Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs
Large Language Models (LLMs) have shown remarkable capabilities in processing both natural and programming languages, which have enabled various applications in software engineering, such as requirement engineering, code generation, and software testing. However, existing code generation benchmarks do not necessarily assess the code understanding performance of LLMs, especially for the subtle inconsistencies that may arise between code and its semantics described in natural language.
In this paper, we propose a novel method to systematically assess the code understanding performance of LLMs, particularly focusing on subtle differences between code and its descriptions, by introducing code mutations to existing code generation datasets. Code mutations are small changes that alter the semantics of the original code, creating a mismatch with the natural language description. We apply different types of code mutations, such as operator replacement and statement deletion, to generate inconsistent code-description pairs. We then use these pairs to test the ability of LLMs to correctly detect the inconsistencies.
We propose a new LLM testing method, called Mutation-based Consistency Testing (MCT), and conduct a case study on the two popular LLMs, GPT-3.5 and GPT-4, using the state-of-the-art code generation benchmark, HumanEval-X, which consists of six programming languages (Python, C++, Java, Go, JavaScript, and Rust). We compare the performance of the LLMs across different types of code mutations and programming languages and analyze the results. We find that the LLMs show significant variation in their code understanding performance and that they have different strengths and weaknesses depending on the mutation type and language. We further explain conditions under which the LLMs result in correct answers using input characteristics (e.g., number of tokens) and investigate to what extent the test results can be improved using one-shot prompts (i.e., providing an additional example). Our MCT method and the case study results provide valuable implications for future research and development of LLM-based software engineering.
Mon 15 AprDisplayed time zone: Lisbon change
14:00 - 15:30 | |||
14:00 15mTalk | A Combinatorial Testing Approach to Hyperparameter OptimizationDistinguished paper Award Candidate Research and Experience Papers Krishna Khadka The University of Texas at Arlington, Jaganmohan Chandrasekaran Virginia Tech, Jeff Yu Lei University of Texas at Arlington, Raghu Kacker National Institute of Standards and Technology, D. Richard Kuhn National Institute of Standards and Technology | ||
14:15 15mTalk | Mutation-based Consistency Testing for Evaluating the Code Understanding Capability of LLMs Research and Experience Papers | ||
14:30 10mTalk | LLMs for Test Input Generation for Semantic Applications Research and Experience Papers Zafaryab Rasool Applied Artificial Intelligence Institute, Deakin University, Scott Barnett Applied Artificial Intelligence Institute, Deakin University, David Willie Applied Artificial Intelligence Institute, Deakin University, Stefanus Kurniawan Deakin University, Sherwin Balugo Applied Artificial Intelligence Institute, Deakin University, Srikanth Thudumu Deakin University, Mohamed Abdelrazek Deakin University, Australia | ||
14:40 10mTalk | (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs Research and Experience Papers MA Wanqin The Hong Kong University of Science and Technology, Chenyang Yang Carnegie Mellon University, Christian Kästner Carnegie Mellon University | ||
14:50 10mTalk | Welcome Your New AI Teammate: On Safety Analysis by Leashing Large Language Models Research and Experience Papers Ali Nouri Volvo cars & Chalmers University of Technology, Beatriz Cabrero-Daniel University of Gothenburg, Fredrik Torner Volvo cars, Hakan Sivencrona Zenseact AB, Christian Berger Chalmers University of Technology, Sweden | ||
15:00 10mTalk | ML-On-Rails: Safeguarding Machine Learning Models in Software Systems – A Case Study Research and Experience Papers Hala Abdelkader Applied Artificial Intelligence Institute, Deakin University, Mohamed Abdelrazek Deakin University, Australia, Scott Barnett Applied Artificial Intelligence Institute, Deakin University, Jean-Guy Schneider Monash University, Priya Rani RMIT University, Rajesh Vasa Deakin University, Australia | ||
15:10 20mLive Q&A | Test - Q&A Session Research and Experience Papers |