Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset (ASE 2025 - Research Papers)

Who

Davide Molinelli, Luca Di Grazia, Alberto Martin-Lopez, Michael D. Ernst, Mauro Pezze

Track

ASE 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 18 Nov 2025 14:30 - 14:40 at Vista - Testing & Analysis 2 Chair(s): Xiaoyin Wang

Abstract

The oracle problem — the efficient generation of thorough test oracles — is still an open problem. Popular test case generators, like EvoSuite and Randoop, rely on implicit, rule-based, and regression oracles that miss failures that depend on the semantics of the program under test. Specified test oracles shift the costs of generating oracles to the production of formal specifications.

Large Language Models (LLMs) have the potential to over-come these limitations. The few studies of using LLM to automatically generate test oracles validate LLMs on modest-sized public benchmarks, such as Defects4J, that are likely to be included in the LLM training benchmark, with severe threats to the validity of the results.

This paper presents an empirical study of the effectiveness of LLMs in generating test oracles. We report the results of experimenting with 13,866 test oracles that we mined from 135 Java projects, and that were created after the cut-off dates of the training of the LLMs used in the experiments, and are thus unbiased.

The results of the experiments that we report in this paper indicate that LLMs indeed generate effective oracles that largely increase the mutation score of the test cases, reaching a mutation score comparable to the score of human-designed test oracles. Our results also indicate that the test prefix and the methods called in the program under test provide sufficient information to generate good oracles, while additional code context does not bring relevant benefits. These findings provide actionable insights into using LLMs for automatic testing and highlight their current limitations in generating complex oracles.

File attachments

Paper (ase2025.pdf)	387KiB

Davide Molinelli

USI Lugano; Schaffhausen Institute of Technology

Switzerland

Luca Di Grazia

University of St. Gallen

Switzerland

Alberto Martin-Lopez

Software Institute - USI, Lugano

Switzerland

Michael D. Ernst

University of Washington

United States

Mauro Pezze

Università della Svizzera italiana (USI) and Università degli Studi di Milano Bicocca

Switzerland

Media

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 18 Nov
Displayed time zone: Seoul change

14:00 - 15:30	Testing & Analysis 2Research Papers / Journal-First at Vista Chair(s): Xiaoyin Wang University of Texas at San Antonio

14:00 10m Talk		Quantum Circuit Mutants: Empirical Analysis and Recommendations Journal-First Eñaut Mendiluze Usandizaga Simula Research Laboratory, Norway, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Tao Yue Beihang University, Paolo Arcaini National Institute of Informatics Link to publication DOI
14:10 10m Talk		MET-MAPF: A Metamorphic Testing Approach for Multi-Agent Path Finding Algorithms Journal-First Xiao-Yi Zhang University of Science and Technology Beijing, Yang Liu Nanyang Technological University, Paolo Arcaini National Institute of Informatics , Mingyue Jiang Zhejiang Sci-Tech University, Zheng Zheng Beihang University Link to publication DOI
14:20 10m Talk		State Field Coverage: A Metric for Oracle Quality Research Papers Facundo Molina IMDEA Software Institute, Nazareno Aguirre University of Rio Cuarto/CONICET, Argentina, and Guangdong Technion-Israel Institute of Technology, China, Alessandra Gorla IMDEA Software Institute
14:30 10m Talk		Do LLMs Generate Useful Test Oracles? An Empirical Study with an Unbiased Dataset Research Papers Davide Molinelli USI Lugano; Schaffhausen Institute of Technology, Luca Di Grazia University of St. Gallen, Alberto Martin-Lopez Software Institute - USI, Lugano, Michael D. Ernst University of Washington, Mauro Pezze Università della Svizzera italiana (USI) and Università degli Studi di Milano Bicocca Media Attached File Attached
14:40 10m Talk		Finding Safety Violations of AI-Enabled Control Systems through the Lens of Synthesized Proxy Programs Journal-First Jieke Shi Singapore Management University, Zhou Yang University of Alberta, Alberta Machine Intelligence Institute , Junda He Singapore Management University, Bowen Xu North Carolina State University, Dongsun Kim Korea University, DongGyun Han Royal Holloway, University of London, David Lo Singapore Management University Link to publication DOI Pre-print
14:50 10m Talk		ZendDiff: Differential Testing of PHP Interpreter Research Papers Yuancheng Jiang National University of Singapore, Jianing Wang National University of Singapore, Qiange Liu Beihang University, Yeqi Fu National University of Singapore, Jian Mao Beihang University, Roland H. C. Yap National University of Singapore, Zhenkai Liang National University of Singapore
15:00 10m Talk		SATORI: Static Test Oracle Generation for REST APIs Research Papers Juan C. Alonso Universidad de Sevilla, Alberto Martin-Lopez Software Institute - USI, Lugano, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Antonio Ruiz-Cortés University of Seville
15:10 10m Talk		Exact Inference for Quantum Circuits: A Testing Oracle for Quantum Software Stacks Research Papers Kanguk Lee KAIST, Jaemin Hong KAIST, Sukyoung Ryu KAIST
15:20 10m Talk		Identifying inconsistent software defect predictions with symmetry metamorphic relation pattern Journal-First Chan Pak Yuen Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong, China, Jacky Keung City University of Hong Kong, Zhen Yang Shandong University

Hide past events