What’s in a Benchmark? The Case of SWE-Bench in Automated Program Repair
The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards—SWE-Bench Lite and Verified—have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 99 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions—typically open source—also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.
Wed 15 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
16:00 - 17:30 | AI for Software Engineering 8Research Track / SE In Practice (SEIP) at Asia IV Chair(s): Yintong Huo Singapore Management University, Singapore | ||
16:00 15mTalk | Quantifying Memorization Advantage in Code LLMs Research Track Alberick Euraste Djire University of Luxembourg, Abdoul Kader Kaboré University of Luxembourg, Jordan Samhi University of Luxembourg, Luxembourg, Earl T. Barr University College London, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg | ||
16:15 15mTalk | Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models Research Track Changshu Liu University of Illinois at Urbana-Champaign, Yang Chen University of Illinois at Urbana-Champaign, Reyhaneh Jabbarvand University of Illinois at Urbana-Champaign Pre-print Media Attached | ||
16:30 15mTalk | Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark Research Track Dewu Zheng Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Ensheng Shi Huawei, Xilin Liu Huawei Cloud, Yuchi Ma Huawei Cloud Computing Technologies, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University Media Attached | ||
16:45 15mTalk | What’s in a Benchmark? The Case of SWE-Bench in Automated Program Repair SE In Practice (SEIP) Matias Martinez Universitat Politècnica de Catalunya (UPC), Xavier Franch Universitat Politècnica de Catalunya | ||
17:00 15mTalk | The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason SE In Practice (SEIP) Shanchao Liang Purdue University, USA, Spandan Garg Microsoft Corporation, Roshanak Zilouchian Moghaddam Microsoft Media Attached | ||
17:15 15mTalk | Rethinking the Evaluation of Secure Code Generation Research Track Shih-Chieh Dai University of Utah, USA, Jun Xu The University of Utah, Guanhong Tao University of Utah | ||