CI-Bench: A Framework for Evaluating Large Language Model Tools on CI Failures
Large language models (LLMs) have demonstrated their potential in performing complex software engineering (SE) tasks. Rigorous evaluation of LLMs and LLM-based tools requires massive, up-to-date data derived from real-world SE processes. Existing continuous integration & delivery (CI/CD) datasets, such as BugSwarm, provide evolving data mined from software build processes, including complete context of build failures. To harness the CI/CD data, we propose CI-Bench, a unified benchmarking framework designed to evaluate LLM-based program repair tools on software failures from CI/CD processes. CI-Bench retrieves data from the BugSwarm dataset, parses the build logs, and constructs appropriate prompts before invoking LLM-based program repair tools. Additionally, CI-Bench includes an executor that facilitates dynamic evaluation in the identical environment as the original build process. With CI-Bench, we evaluate three state-of-the-art LLM-based program repair tools, Agentless, SWE-Agent, and AutoCodeRover, on a code repair task involving 100 real-world CI/CD failures using GPT-4o, Claude-3.5-Sonnet, and Deepseek-V3 as foundation models. The evaluation shows that Agentless, SWE-Agent, and AutoCodeRover achieve success rates up to 32%, 36%, and 13% in generating correct patches, respectively. CI-Bench is available at https://github.com/bugswarm/ci-bench and a demo of the tool can be found at https://youtu.be/BM0K-P38MOg.
Fri 17 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
14:00 - 15:30 | AI for Software Engineering 23Research Track / Demonstrations / Journal-first Papers at Asia I Chair(s): Wesley K.G. Assunção North Carolina State University | ||
14:00 15mTalk | CI-Bench: A Framework for Evaluating Large Language Model Tools on CI Failures Demonstrations Raian Latif Nabil University of California, Davis, Hao-Nan Zhu University of California, Davis, Cindy Rubio-González University of California at Davis | ||
14:15 15mTalk | Assessing the Latent Automated Program Repair Capabilities of Large Language Models using Round-Trip Translation Journal-first Papers Fernando Vallecillos Ruiz Simula Research Laboratory, Anastasiia Grishina Simula Research Laboratory, Max Hort Simula Research Laboratory, Leon Moonen Simula Research Laboratory Link to publication Pre-print | ||
14:30 15mTalk | XRFix: Exploring Performance Bug Repair of Extended Reality Applications with Large Language Models Research Track Jingwen Wu Department of Computer Science, Hong Kong Baptist University, Hanyang Guo School of Software Engineering, Sun Yat-sen University, Hong-Ning Dai Department of Computer Science, Hong Kong Baptist University, Xiapu Luo Hong Kong Polytechnic University DOI Pre-print | ||
14:45 15mTalk | Synthetic Repo-level Bug Dataset for Training Automated Program Repair ModelsDistinguished Paper Award Research Track Minh V. T. Pham FPT Software AI Center, Huy N. Phan FPT Software AI Center, Hoang Nhat Phan Nanyang Technological University, Cuong Chi Le The University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas, Nghi D. Q. Bui Google Research | ||
15:00 15mTalk | PredicateFix: Repairing Static Analysis Alerts with Bridging Predicates Research Track Yuan-An Xiao Peking University, Weixuan Wang Peking University, Dong Liu Center Research Institute, ZTE Coporation, China, Junwei Zhou Center Research Institute, ZTE Coporation, China, Shengyu Cheng ZTE Corporation, Yingfei Xiong Peking University Pre-print | ||
15:15 15mTalk | Input Reduction Enhanced LLM-based Program Repair Research Track Boyang Yang Yanshan University, Luyao Ren Peking University, Xin Yin Zhejiang University, Jiadong Ren Yanshan University, Haoye Tian Aalto University, Shunfu Jin Yanshan University DOI Pre-print | ||