Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL tasks depends on a model’s ability to reason about program semantics that are beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from program semantic reasoning. Meanwhile, traditional fault localization benchmarks like Defect4J and BugsInPy are either not scalable or obsolete because their datasets have become part of LLM training data, leading to biased results. This paper presents the first large-scale empirical investigation into the robustness of LLMs’ fault localizability. Inspired by mutation testing, we develop an end-to-end evaluation framework that addresses several limitations in current LLM evaluation, e.g., data contamination, scalability, automation, and extensibility. Given real-world seed programs with specifications, we inject unseen faults and ask LLMs to localize them. We filter out underspecified programs, where correct fault localization is inherently ambiguous. For each program an LLM localizes successfully, we apply semantic preserving mutations (SPMs) and rerun localization to assess the LLM’s robustness and whether the LLM’s reasoning relies on syntactic cues rather than semantics. We evaluate 10 state-of-the-art LLMs on 750,013 fault-localization tasks sourced from over 1300 Java and Python programs. We observe that SPMs cause an LLM to fail to localize the same fault it correctly localized earlier in 78% of cases, and that LLMs’ reasoning on the code found earlier in the context is noticeably better. These results suggest that LLMs’ code-reasoning is tied to code features irrelevant to semantics. We also identify code patterns that are challenging for LLMs to reason about. To the best of our knowledge, no prior work has evaluated the robustness of LLMs’ code reasoning in fault localization at this scale. Overall, our findings motivate fundamental advances in how LLMs represent, interpret, and prioritize code semantics to reason more deeply about program logic.
Tue 19 MayDisplayed time zone: Seoul change
16:00 - 17:30 | Machine Learning for Code Analysis & ReviewResearch Papers / Industry at Room 103 Chair(s): Dietmar Pfahl University of Tartu | ||
16:00 25mTalk | Understanding and Improving ML-based Static Analysis Result Classification via Explainable AI Research Papers Sai Yerramreddy University of Maryland, Mohammad Rafieian The University of Texas at Dallas, Shiyi Wei University of Texas at Dallas, Adam Porter University of Maryland, College Park | ||
16:25 25mTalk | Adaptive Mixing of Embeddings from Multiple Code Language Models for Fault Localization Research Papers Juyoung Yang Korea Advanced Institute of Science and Technology (KAIST), Eunchan Park Korea Advanced Institute of Science and Technology (KAIST), In-Young Ko Korea Advanced Institute of Science and Technology | ||
16:50 25mTalk | Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models Research Papers Sabaat Haroon Virginia tech, Ahmand Faraz Khan Virginia Tech, Ahmad Humayun Virginia Tech, Waris Gill Virginia Tech, Abdul Haddi Amjad Palo Alto Networks, Ali R. Butt Virginia Tech, Mohammad Taha Khan Carnegie Mellon University, Muhammad Ali Gulzar Virginia Tech | ||
17:15 15mTalk | When Less Is More: Monolingual Fine-Tuning of Language Models for Industrial C# Code Review Industry Igli Begolli Technical University Dortmund, Lovion GmbH, Meltem Aksoy TU Dortmund University, Daniel Neider Technical University of Dortmund, Germany | ||