Automatic Validation of LLM-Generated Code with Prompt Paraphrasing
Large Language Models (LLMs) have been widely used for code generation. However, the quality of generated code is still questionable, and code validation remains a challenging problem. In this paper, we propose a novel solution called metamorphic prompt testing. Our intuitive observation is that intrinsic consistency always exists among correct code pieces but may not exist among flawed code pieces, so we can detect flaws in the code by detecting inconsistencies. Therefore, we can vary a given prompt to multiple prompts with paraphrasing, and to ask the LLM to acquire multiple versions of generated code, so that we can validate whether the semantic relations still hold in the acquired code through cross-validation. Our initial evaluation on HumanEval shows that metamorphic prompt testing is able to detect 70.6% of the erroneous programs generated by GPT-4o, with a false positive rate of 6.1%.
Thu 16 AprDisplayed time zone: Brasilia, Distrito Federal, Brazil change
11:00 - 12:30 | Testing and Analysis 9Research Track / Journal-first Papers / Demonstrations / New Ideas and Emerging Results (NIER) at Oceania II Chair(s): Shiyi Wei University of Texas at Dallas | ||
11:00 15mTalk | GUISpector: An MLLM Agent Framework for Automated Verification of Natural Language Requirements in GUI Prototypes Demonstrations Kristian Kolthoff Institute for Software and Systems Engineering, Clausthal University of Technology, Felix Kretzer human-centered systems Lab (h-lab), Karlsruhe Institute of Technology (KIT) , Simone Paolo Ponzetto Data and Web Science Group, University of Mannheim, Alexander Maedche human-centered systems Lab (h-lab), Karlsruhe Institute of Technology (KIT) , Christian Bartelt Institute for Software and Systems Engineering, TU Clausthal Pre-print Media Attached | ||
11:15 15mTalk | Valg: A Fast Reinforcement Learning-Based Runtime Verification Tool for Java Demonstrations Shinhae Kim Cornell University, Saikat Dutta Cornell University, Owolabi Legunsen Cornell University | ||
11:30 15mTalk | Quantum Neural Network Classifier for Cancer Registry System Testing: A Feasibility Study Journal-first Papers Xinyi Wang Simula Research Laboratory; University of Oslo, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Paolo Arcaini National Institute of Informatics, Narasimha Raghavan Veeraragavan Cancer Registry of Norway and Norwegian Institute of Public Health, Jan F. Nygård Cancer Registry of Norway Link to publication DOI | ||
11:45 15mTalk | Testora: Using Natural Language Intent to Detect Behavioral Regressions Research Track Michael Pradel CISPA Helmholtz Center for Information Security | ||
12:00 15mTalk | Automatic Validation of LLM-Generated Code with Prompt Paraphrasing New Ideas and Emerging Results (NIER) | ||
12:15 15mTalk | Causally Perturbed Fairness Testing Journal-first Papers | ||