Evaluating the Generalizability of LLMs in Automated Program Repair (ICSE 2025 - New Ideas and Emerging Results (NIER))

Who

Fengjie Li, Jiajun Jiang, Jiajun Sun, Hongyu Zhang

Track

ICSE 2025 New Ideas and Emerging Results (NIER)

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 2 May 2025 15:00 - 15:15 at 214 - AI for Testing and QA 6 Chair(s): Ladan Tahvildari

Abstract

LLM-based automated program repair methods have attracted significant attention for their state-of-the-art performance. However, they were primarily evaluated on a few well-known datasets like Defects4J, raising questions about their effectiveness on new datasets. In this study, we evaluate 11 top-performing LLMs on DEFECTS4J-TRANS, a new dataset derived from transforming Defects4J while maintaining fault semantics. Results from experiments on both Defects4J and DEFECTS4J-TRANS show that all studied LLMs have limited generalizability in APR tasks, with average correct and plausible patches decreasing by 49.48% and 42.90%, respectively, on DEFECTS4J-TRANS. Further investigation into incorporating additional repair-relevant information in repair prompts reveals that, although this information signigicantly enhances the LLMs’ capabilities (increasing correct and plausible patches by up to 136.67% and 121.82%, respectively), performance still falls short of their original dataset results. This indicates that prompt engineering alone is insufficient to substantially enhance LLMs’ repair capabilities. According our study, we also offer several recommendations for future research.

Link to Preprint

https://arxiv.org/abs/2503.09217

Fengjie Li

Tianjin University

China

Jiajun Jiang

Tianjin University

China

Jiajun Sun

Tianjin University

Hongyu Zhang

Chongqing University

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	AI for Testing and QA 6Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) at 214 Chair(s): Ladan Tahvildari University of Waterloo

14:00 15m Talk		Treefix: Enabling Execution with a Tree of Prefixes Research Track Beatriz Souza Universität Stuttgart, Michael Pradel University of Stuttgart Pre-print
14:15 15m Talk		Assessing Evaluation Metrics for Neural Test Oracle Generation Journal-first Papers Jiho Shin York University, Hadi Hemmati York University, Moshi Wei York University, Song Wang York University
14:30 15m Talk		Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement Journal-first Papers Saurabhsingh Rajput Dalhousie University, Tim Widmayer University College London (UCL), Ziyuan Shang Nanyang Technological University, Maria Kechagia National and Kapodistrian University of Athens, Federica Sarro University College London, Tushar Sharma Dalhousie University
14:45 15m Talk		Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality Journal-first Papers Hao Li Queen's University, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Cor-Paul Bezemer University of Alberta Link to publication DOI Pre-print
15:00 15m Talk		Evaluating the Generalizability of LLMs in Automated Program Repair New Ideas and Emerging Results (NIER) Fengjie Li Tianjin University, Jiajun Jiang Tianjin University, Jiajun Sun Tianjin University, Hongyu Zhang Chongqing University Pre-print
15:15 15m Talk		How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study New Ideas and Emerging Results (NIER) Alejandro Velasco William & Mary, Daniel Rodriguez-Cardenas William & Mary, David Nader Palacio William & Mary, Lutfar Rahman Alif University of Dhaka, Denys Poshyvanyk William & Mary Pre-print