Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset
This program is tentative and subject to change.
The oracle problem — the efficient generation of thorough test oracles — is still an open problem. Popular test case generators, like EvoSuite and Randoop, rely on implicit, rule-based, and regression oracles that miss failures that depend on the semantics of the program under test. Specified test oracles shift the costs of generating oracles to the production of formal specifications.
Large Language Models (LLMs) have the potential to over-come these limitations. The few studies of using LLM to automatically generate test oracles validate LLMs on modest-sized public benchmarks, such as Defects4J, that are likely to be included in the LLM training benchmark, with severe threats to the validity of the results.
This paper presents an empirical study of the effectiveness of LLMs in generating test oracles. We report the results of experimenting with 13,866 test oracles that we mined from 135 Java projects, and that were created after the cut-off dates of the training of the LLMs used in the experiments, and are thus unbiased.
The results of the experiments that we report in this paper indicate that LLMs indeed generate effective oracles that largely increase the mutation score of the test cases, reaching a mutation score comparable to the score of human-designed test oracles. Our results also indicate that the test prefix and the methods called in the program under test provide sufficient information to generate good oracles, while additional code context does not bring relevant benefits. These findings provide actionable insights into using LLMs for automatic testing and highlight their current limitations in generating complex oracles.
| Paper (ase2025.pdf) | 387KiB |
This program is tentative and subject to change.
Tue 18 NovDisplayed time zone: Seoul change
14:00 - 15:30 | Testing & Analysis 2Research Papers / Journal-First at Vista Chair(s): Xiaoyin Wang University of Texas at San Antonio | ||
14:00 10mTalk | Quantum Circuit Mutants: Empirical Analysis and Recommendations Journal-First Eñaut Mendiluze Usandizaga Simula Research Laboratory, Norway, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Tao Yue Beihang University, Paolo Arcaini National Institute of Informatics
Link to publication DOI | ||
14:10 10mTalk | MET-MAPF: A Metamorphic Testing Approach for Multi-Agent Path Finding Algorithms Journal-First Xiao-Yi Zhang University of Science and Technology Beijing, Yang Liu Nanyang Technological University, Paolo Arcaini National Institute of Informatics
, Mingyue Jiang Zhejiang Sci-Tech University, Zheng Zheng Beihang University Link to publication DOI | ||
14:20 10mTalk | State Field Coverage: A Metric for Oracle Quality Research Papers Facundo Molina IMDEA Software Institute, Nazareno Aguirre University of Rio Cuarto/CONICET, Argentina, and Guangdong Technion-Israel Institute of Technology, China, Alessandra Gorla IMDEA Software Institute | ||
14:30 10mTalk | Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset Research Papers Davide Molinelli USI Lugano; Schaffhausen Institute of Technology, Luca Di Grazia University of St. Gallen, Alberto Martin-Lopez Software Institute - USI, Lugano, Michael D. Ernst University of Washington, Mauro Pezze Università della Svizzera italiana (USI) and Università degli Studi di Milano Bicocca Media Attached File Attached | ||
14:40 10mTalk | Finding Safety Violations of AI-Enabled Control Systems through the Lens of Synthesized Proxy Programs Journal-First Jieke Shi Singapore Management University, Zhou Yang University of Alberta, Alberta Machine Intelligence Institute , Junda He Singapore Management University, Bowen Xu North Carolina State University, Dongsun Kim Korea University, DongGyun Han Royal Holloway, University of London, David Lo Singapore Management University Link to publication DOI Pre-print | ||
14:50 10mTalk | ZendDiff: Differential Testing of PHP Interpreter Research Papers Yuancheng Jiang National University of Singapore, Jianing Wang National University of Singapore, Qiange Liu Beihang University, Yeqi Fu National University of Singapore, Jian Mao Beihang University, Roland H. C. Yap National University of Singapore, Zhenkai Liang National University of Singapore | ||
15:00 10mTalk | SATORI: Static Test Oracle Generation for REST APIs Research Papers Juan C. Alonso Universidad de Sevilla, Alberto Martin-Lopez Software Institute - USI, Lugano, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Antonio Ruiz-Cortés University of Seville | ||
15:10 10mTalk | Exact Inference for Quantum Circuits: A Testing Oracle for Quantum Software Stacks Research Papers | ||
15:20 10mTalk | Identifying inconsistent software defect predictions with symmetry metamorphic relation pattern Journal-First Chan Pak Yuen Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China, Jacky Keung City University of Hong Kong, Zhen Yang Shandong University | ||