ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil
Wed 15 Apr 2026 16:45 - 17:00 at Asia IV - AI for Software Engineering 8 Chair(s): Yintong Huo

The rapid progress in Automated Program Repair (APR) has been fueled by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a benchmark designed to evaluate repair systems using real issues mined from popular open-source Python repositories. Its public leaderboards—SWE-Bench Lite and Verified—have become central platforms for tracking progress and comparing solutions. In this paper, we present the first comprehensive study of these two leaderboards, examining who is submitting solutions, the products behind the submissions, the LLMs employed, and the openness of the approaches. We analyze 99 entries submitted to Lite leaderboard and 133 to Verified. Our results show that most entries on both leaderboards originate from industry, particularly small companies and large publicly traded companies. These submissions often achieve top results, although academic contributions—typically open source—also remain competitive. We also find a clear dominance of proprietary LLMs, especially Claude family, with state-of-the-art results on both leaderboards currently achieved by Claude 4 Sonnet. These findings offer insights into the SWE-Bench ecosystem that can guide greater transparency and diversity in future benchmark-driven research.

Wed 15 Apr

Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 17:30
AI for Software Engineering 8Research Track / SE In Practice (SEIP) at Asia IV
Chair(s): Yintong Huo Singapore Management University, Singapore
16:00
15m
Talk
Quantifying Memorization Advantage in Code LLMs
Research Track
Alberick Euraste Djire University of Luxembourg, Abdoul Kader Kaboré University of Luxembourg, Jordan Samhi University of Luxembourg, Luxembourg, Earl T. Barr University College London, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg
16:15
15m
Talk
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language ModelsVirtual Attendance
Research Track
Changshu Liu University of Illinois at Urbana-Champaign, Yang Chen University of Illinois at Urbana-Champaign, Reyhaneh Jabbarvand University of Illinois at Urbana-Champaign
Pre-print Media Attached
16:30
15m
Talk
Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation BenchmarkVirtual Attendance
Research Track
Dewu Zheng Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Ensheng Shi Huawei, Xilin Liu Huawei Cloud, Yuchi Ma Huawei Cloud Computing Technologies, Hongyu Zhang Chongqing University, Zibin Zheng Sun Yat-sen University
Media Attached
16:45
15m
Talk
What’s in a Benchmark? The Case of SWE-Bench in Automated Program Repair
SE In Practice (SEIP)
Matias Martinez Universitat Politècnica de Catalunya (UPC), Xavier Franch Universitat Politècnica de Catalunya
17:00
15m
Talk
The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of ReasonVirtual Attendance
SE In Practice (SEIP)
Shanchao Liang Purdue University, USA, Spandan Garg Microsoft Corporation, Roshanak Zilouchian Moghaddam Microsoft
Media Attached
17:15
15m
Talk
Rethinking the Evaluation of Secure Code Generation
Research Track
Shih-Chieh Dai University of Utah, USA, Jun Xu The University of Utah, Guanhong Tao University of Utah