RepairBench: Leaderboard of Frontier Models for Program Repair (LLM4Code 2025)

Who

André Silva, Martin Monperrus

Track

LLM4Code 2025 Large Language Models for Code

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 10:20 - 10:30 at 214 - Opening / Keynote 1 / Paper Session 1 Chair(s): Zijian Wang

Abstract

AI-driven program repair uses AI models to repair buggy software by producing patches. Rapid advancements in frontier models surely impact performance on the program repair task. Yet, there is a lack of frequent and standardized evaluations to actually understand the strengths and weaknesses of models. To that end, we propose RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of RepairBench are: 1) it is execution-based: all patches are compiled and executed against a test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier models only against real-world program repair tasks. At the time of writing, RepairBench shows that \textit{claude-3-5-sonnet-20241022} is the best model for program repair, and \textit{qwen-2.5-coder-32b-instruct} the cheapest while maintaining good performance. We publicly release the evaluation framework of RepairBench as well as all patches generated in the course of the evaluation.

André Silva

KTH Royal Institute of Technology

Sweden

Martin Monperrus