A Benchmark for Language Models in Real-World System Building
Software package build repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems. While Large Language Models (LLMs) have shown promise in tackling this challenge, prior work has primarily focused on single instruction set architectures (ISAs) and homogeneous programming languages. To address this limitation, we introduce a new benchmark designed for software package build repair across diverse architectures and languages. Comprising 268 real-world software package build failures, the benchmark provides a standardized evaluation pipeline. We evaluate six state-of-the-art LLMs on the benchmark, and the results show that cross-ISA software package repair remains difficult and requires further advances. By systematically exposing this challenge, the benchmark establishes a foundation for advancing future methods aimed at improving software portability and bridging architectural gaps.
Tue 14 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
16:00 - 16:40 | Analysis and Optimization for AI-generated Code (Virtual Session)LLM4Code at Oceania I Chair(s): Kexin Pei The University of Chicago Zoom Link: https://us06web.zoom.us/j/89846180915 | ||
16:00 10mTalk | Do LLMs Dream of Energy-Efficient Code? LLM4Code Antimo Di Bernardo Independent Researcher, Gianluca Capozzi Sapienza University of Rome, Pasquale De Rosa University of Neuchâtel, Daniele Cono D'Elia Sapienza University of Rome, Leonardo Querzoni Sapienza University of Rome, Giuseppe Antonio Di Luna Sapienza University of Rome, Valerio Schiavoni University of Neuchâtel | ||
16:10 10mTalk | A Benchmark for Language Models in Real-World System Building LLM4Code Weilin Jin Peking University, Chenyu Zhao Nankai University, Zeshun Huang Nankai University, Chaoyun Zhang Microsoft, Qingwei Lin Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft, Shenglin Zhang Nankai University, Yongqian Sun Nankai University, Dan Pei Tsinghua University, Yifan Wu Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Zhonghai Wu Peking University, Minghua Ma Microsoft | ||
16:20 10mTalk | An Automated Methodology for Generating Labeled Datasets of Semantic Errors in Code LLM4Code Mahmoud Kassem New York University Abu Dhabi, Francisco Ribeiro New York University, Abu Dhabi, Sarah Nadi New York University Abu Dhabi | ||
16:30 10mTalk | Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study LLM4Code Shijia Dong School of Computing Science, University of Glasgow, Haoruo Zhao School of Computing Science, University of Glasgow, Paul Harvey University of Glasgow, UK | ||