SANER 2026
Tue 17 - Fri 20 March 2026 Limassol, Cyprus

We introduce MutEval, a reproducible dual-mode framework for evaluating the robustness of code generation mod- els. MutEval examines: (i) prompt-level robustness by applying semantically equivalent paraphrases and syntactic perturbations to natural-language prompts, and (ii) code-level robustness by introducing small, controlled mutations to task descriptions or partial code contexts. Built on the HumanEval benchmark, the framework generates deterministic mutations, queries LLMs, and measures robustness using pass@k, CodeBERTScore, and similarity metrics. Across four state-of-the-art code LLMs, we ob- serve consistent degradation patterns. Mutated natural-language prompts lead to moderate drops in similarity (63–68%) and pass@1 (8-18%), with gpt-4o variants showing the strongest resilience. Mutated code contexts, however, are substantially more harmful: similarity falls to 7–15% and Pass@1 declines to 21–32%, over 50% below original performance, despite CodeBERTScore remaining above 79%. These results highlight that code models are significantly more brittle to perturbations in code than in natural language and emphasize the impor- tance of robustness-aware evaluation for reliable deployment. A video demonstration of MutEval is available at https://youtu.be/v5mc8wneqRA.

Thu 19 Mar

Displayed time zone: Athens change

14:00 - 15:30
Session 5A - Robustness and Reliability of LLM Code GenerationShort Papers and Posters Track / Research Track / Tool Demo Track / Early Research Achievement (ERA) Track at Panorama
Chair(s): Mugdha Khedkar Heinz Nixdorf Institute, Paderborn University
14:00
7m
Talk
Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical study on Decision Framework
Short Papers and Posters Track
Jianru Shen University of Montana, Zedong Peng University of Montana, Lucy Owen University of Montana
14:07
15m
Talk
Progressively Mitigating API Hallucination in LLM-Generated Code via Knowledge Graph Reasoning
Research Track
Yuxuan Li Peking University, Zexiong Ma Peking University, Yanzhen Zou Peking University, Yue Wang Peking University, Lihan Yang Peking University, Bing Xie Peking University
14:22
15m
Talk
Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight
Research Track
Micheline Bénédicte MOUMOULA University of Luxembourg, NIKIEMA Beninwende Serge Lionel University of Luxembourg, Abdoul Kader Kaboré University of Luxembourg, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg
14:37
15m
Talk
Can LLMs Keep Up with Library Changes? An Exploratory Study on LLM-Generated Code
Research Track
Xiangrong Lin Zhejiang University, Jiakun Liu Harbin Institute of Technology, Lingfeng Bao Zhejiang University
14:52
15m
Talk
Leveraging Enhanced Test-Driven Development for Accurate Code Generation in LLMs
Research Track
Rui Zhang School of Artificial Intelligence, China University of Geosciences (Beijing), Weijie Shan School of Artificial Intelligence, China University of Geosciences (Beijing), Teng Long School of Artificial Intelligence, China University of Geosciences (Beijing), Ce Fu School of Artificial Intelligence, China University of Geosciences(Beijing)
15:07
7m
Talk
When RAG Lies: Link-Injection Knowledge-Base Poisoning in Code Generation
Short Papers and Posters Track
Nguyen Trung Hieu Hanoi University of Science and Technology, Trung-Hieu Nguyen Hanoi University of Science and Technology, Hanoi, Vietnam, Trong-Nghia Be University of Engineering and Technology, Bao-Huy Hoang Hanoi University of Science and Technology,, Anh M. T. Bui Hanoi University of Science and Technology
15:14
7m
Talk
Grounding Generative AI in Software Engineering: Are We There Yet?
Early Research Achievement (ERA) Track
Mootez Saad Dalhousie University, José Antonio Hernández López Department of Computer Science and Systems, University of Murcia, Boqi Chen McGill University, Neil Ernst University of Victoria, Daniel Varro Linköping University / McGill University, Tushar Sharma Dalhousie University
Pre-print
15:21
7m
Talk
MutEval: NL-PL Prompt Mutation Framework for Robustness Evaluation of Code LLMs
Tool Demo Track
Pre-print Media Attached