ICST 2026
Mon 18 - Fri 22 May 2026 Daejeon, South Korea
Tue 19 May 2026 16:50 - 17:15 at Room 103 - Machine Learning for Code Analysis & Review Chair(s): Dietmar Pfahl

Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL tasks depends on a model’s ability to reason about program semantics that are beyond surface-level syntactic and lexical features. However, widely used LLM benchmarks primarily evaluate code generation, which differs fundamentally from program semantic reasoning. Meanwhile, traditional fault localization benchmarks like Defect4J and BugsInPy are either not scalable or obsolete because their datasets have become part of LLM training data, leading to biased results. This paper presents the first large-scale empirical investigation into the robustness of LLMs’ fault localizability. Inspired by mutation testing, we develop an end-to-end evaluation framework that addresses several limitations in current LLM evaluation, e.g., data contamination, scalability, automation, and extensibility. Given real-world seed programs with specifications, we inject unseen faults and ask LLMs to localize them. We filter out underspecified programs, where correct fault localization is inherently ambiguous. For each program an LLM localizes successfully, we apply semantic preserving mutations (SPMs) and rerun localization to assess the LLM’s robustness and whether the LLM’s reasoning relies on syntactic cues rather than semantics. We evaluate 10 state-of-the-art LLMs on 750,013 fault-localization tasks sourced from over 1300 Java and Python programs. We observe that SPMs cause an LLM to fail to localize the same fault it correctly localized earlier in 78% of cases, and that LLMs’ reasoning on the code found earlier in the context is noticeably better. These results suggest that LLMs’ code-reasoning is tied to code features irrelevant to semantics. We also identify code patterns that are challenging for LLMs to reason about. To the best of our knowledge, no prior work has evaluated the robustness of LLMs’ code reasoning in fault localization at this scale. Overall, our findings motivate fundamental advances in how LLMs represent, interpret, and prioritize code semantics to reason more deeply about program logic.

Tue 19 May

Displayed time zone: Seoul change

16:00 - 17:30
Machine Learning for Code Analysis & ReviewResearch Papers / Industry at Room 103
Chair(s): Dietmar Pfahl University of Tartu
16:00
25m
Talk
Understanding and Improving ML-based Static Analysis Result Classification via Explainable AI
Research Papers
Sai Yerramreddy University of Maryland, Mohammad Rafieian The University of Texas at Dallas, Shiyi Wei University of Texas at Dallas, Adam Porter University of Maryland, College Park
16:25
25m
Talk
Adaptive Mixing of Embeddings from Multiple Code Language Models for Fault Localization
Research Papers
Juyoung Yang Korea Advanced Institute of Science and Technology (KAIST), Eunchan Park Korea Advanced Institute of Science and Technology (KAIST), In-Young Ko Korea Advanced Institute of Science and Technology
16:50
25m
Talk
Assessing the Impact of Code Changes on the Fault Localizability of Large Language ModelsArtifact ReviewedArtifact Available
Research Papers
Sabaat Haroon Virginia tech, Ahmand Faraz Khan Virginia Tech, Ahmad Humayun Virginia Tech, Waris Gill Virginia Tech, Abdul Haddi Amjad Palo Alto Networks, Ali R. Butt Virginia Tech, Mohammad Taha Khan Carnegie Mellon University, Muhammad Ali Gulzar Virginia Tech
17:15
15m
Talk
When Less Is More: Monolingual Fine-Tuning of Language Models for Industrial C# Code Review
Industry
Igli Begolli Technical University Dortmund, Lovion GmbH, Meltem Aksoy TU Dortmund University, Daniel Neider Technical University of Dortmund, Germany