MutEval: NL-PL Prompt Mutation Framework for Robustness Evaluation of Code LLMs
We introduce MutEval, a reproducible dual-mode framework for evaluating the robustness of code generation mod- els. MutEval examines: (i) prompt-level robustness by applying semantically equivalent paraphrases and syntactic perturbations to natural-language prompts, and (ii) code-level robustness by introducing small, controlled mutations to task descriptions or partial code contexts. Built on the HumanEval benchmark, the framework generates deterministic mutations, queries LLMs, and measures robustness using pass@k, CodeBERTScore, and similarity metrics. Across four state-of-the-art code LLMs, we ob- serve consistent degradation patterns. Mutated natural-language prompts lead to moderate drops in similarity (63–68%) and pass@1 (8-18%), with gpt-4o variants showing the strongest resilience. Mutated code contexts, however, are substantially more harmful: similarity falls to 7–15% and Pass@1 declines to 21–32%, over 50% below original performance, despite CodeBERTScore remaining above 79%. These results highlight that code models are significantly more brittle to perturbations in code than in natural language and emphasize the impor- tance of robustness-aware evaluation for reliable deployment. A video demonstration of MutEval is available at https://youtu.be/v5mc8wneqRA.
Thu 19 MarDisplayed time zone: Athens change
14:00 - 15:30 | Session 5A - Robustness and Reliability of LLM Code GenerationShort Papers and Posters Track / Research Track / Tool Demo Track / Early Research Achievement (ERA) Track at Panorama Chair(s): Mugdha Khedkar Heinz Nixdorf Institute, Paderborn University | ||
14:00 7mTalk | Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical study on Decision Framework Short Papers and Posters Track Jianru Shen University of Montana, Zedong Peng University of Montana, Lucy Owen University of Montana | ||
14:07 15mTalk | Progressively Mitigating API Hallucination in LLM-Generated Code via Knowledge Graph Reasoning Research Track Yuxuan Li Peking University, Zexiong Ma Peking University, Yanzhen Zou Peking University, Yue Wang Peking University, Lihan Yang Peking University, Bing Xie Peking University | ||
14:22 15mTalk | Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight Research Track Micheline Bénédicte MOUMOULA University of Luxembourg, NIKIEMA Beninwende Serge Lionel University of Luxembourg, Abdoul Kader Kaboré University of Luxembourg, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg | ||
14:37 15mTalk | Can LLMs Keep Up with Library Changes? An Exploratory Study on LLM-Generated Code Research Track Xiangrong Lin Zhejiang University, Jiakun Liu Harbin Institute of Technology, Lingfeng Bao Zhejiang University | ||
14:52 15mTalk | Leveraging Enhanced Test-Driven Development for Accurate Code Generation in LLMs Research Track Rui Zhang School of Artificial Intelligence, China University of Geosciences (Beijing), Weijie Shan School of Artificial Intelligence, China University of Geosciences (Beijing), Teng Long School of Artificial Intelligence, China University of Geosciences (Beijing), Ce Fu School of Artificial Intelligence, China University of Geosciences(Beijing) | ||
15:07 7mTalk | When RAG Lies: Link-Injection Knowledge-Base Poisoning in Code Generation Short Papers and Posters Track Nguyen Trung Hieu Hanoi University of Science and Technology, Trung-Hieu Nguyen Hanoi University of Science and Technology, Hanoi, Vietnam, Trong-Nghia Be University of Engineering and Technology, Bao-Huy Hoang Hanoi University of Science and Technology,, Anh M. T. Bui Hanoi University of Science and Technology | ||
15:14 7mTalk | Grounding Generative AI in Software Engineering: Are We There Yet? Early Research Achievement (ERA) Track Mootez Saad Dalhousie University, José Antonio Hernández López Department of Computer Science and Systems, University of Murcia, Boqi Chen McGill University, Neil Ernst University of Victoria, Daniel Varro Linköping University / McGill University, Tushar Sharma Dalhousie University Pre-print | ||
15:21 7mTalk | MutEval: NL-PL Prompt Mutation Framework for Robustness Evaluation of Code LLMs Tool Demo Track Pre-print Media Attached | ||