Synthetic Repo-level Bug Dataset for Training Automated Program Repair ModelsDistinguished Paper Award
Automated program repair (APR) aims to autonomously fix software bugs, yet its effectiveness is hampered by the lack of diverse, real-world bug datasets essential for model training. Although combining large-scale mining with human effort can yield such datasets, the associated costs limit scalability. To address this, we introduce a novel, scalable synthetic data pipeline that leverages large language models (LLMs) to generate synthetic bugs through targeted LLM-based code rewriting. Our pipeline is also capable of synthesizing valuable intermediate repair steps and enriches the training signal toward correct fixes. Using our method, we create SWE-Synth, a large and contextually rich dataset of bug-fix pairs that are natural, scalable, automated verifiable, and contain intermediate repair steps. Training LLMs on our synthetic dataset yields context-aware repair strategies, that achieve repair accuracy equivalent to those trained on manually curated datasets from Github like SWE-Gym while delivering superior scalability with effortless bug synthesis, as demonstrated on popular benchmarks (SWE-Bench and BugsInPy).
Fri 17 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
14:00 - 15:30 | AI for Software Engineering 23Research Track / Demonstrations / Journal-first Papers at Asia I Chair(s): Wesley K.G. Assunção North Carolina State University | ||
14:00 15mTalk | CI-Bench: A Framework for Evaluating Large Language Model Tools on CI Failures Demonstrations Raian Latif Nabil University of California, Davis, Hao-Nan Zhu University of California, Davis, Cindy Rubio-González University of California at Davis | ||
14:15 15mTalk | Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation Journal-first Papers Fernando Vallecillos Ruiz Simula Research Laboratory, Anastasiia Grishina Simula Research Laboratory, Max Hort Simula Research Laboratory, Leon Moonen Simula Research Laboratory Link to publication Pre-print | ||
14:30 15mTalk | XRFix: Exploring Performance Bug Repair of Extended Reality Applications with Large Language Models Research Track Jingwen Wu Department of Computer Science, Hong Kong Baptist University, Hanyang Guo School of Software Engineering, Sun Yat-sen University, Hong-Ning Dai Department of Computer Science, Hong Kong Baptist University, Xiapu Luo Hong Kong Polytechnic University DOI Pre-print | ||
14:45 15mTalk | Synthetic Repo-level Bug Dataset for Training Automated Program Repair ModelsDistinguished Paper Award Research Track Minh V. T. Pham FPT Software AI Center, Huy N. Phan FPT Software AI Center, Hoang Nhat Phan Nanyang Technological University, Cuong Chi Le The University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas, Nghi D. Q. Bui Google Research | ||
15:00 15mTalk | PredicateFix: Repairing Static Analysis Alerts with Bridging Predicates Research Track Yuan-An Xiao Peking University, Weixuan Wang Peking University, Dong Liu Center Research Institute, ZTE Coporation, China, Junwei Zhou Center Research Institute, ZTE Coporation, China, Shengyu Cheng ZTE Corporation, Yingfei Xiong Peking University Pre-print | ||
15:15 15mTalk | Input Reduction Enhanced LLM-based Program Repair Research Track Boyang Yang Yanshan University, Luyao Ren Peking University, Xin Yin Zhejiang University, Jiadong Ren Yanshan University, Haoye Tian Aalto University, Shunfu Jin Yanshan University DOI Pre-print | ||