Assessing Evaluation Metrics for Neural Test Oracle Generation (ICSE 2025 - Journal-first Papers)

Who

Jiho Shin, Hadi Hemmati, Moshi Wei, Song Wang

Track

ICSE 2025 Journal-first Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 2 May 2025 14:15 - 14:30 at 214 - AI for Testing and QA 6 Chair(s): Ladan Tahvildari

Abstract

Recently, deep learning models have shown promising results in test oracle generation. Neural Oracle Generation (NOG) models are commonly evaluated using static (automatic) metrics which are mainly based on textual similarity of the output, e.g. BLEU, ROUGE-L, METEOR, and Accuracy. However, these textual similarity metrics may not reflect the testing effectiveness of the generated oracle within a test suite, which is often measured by dynamic (execution-based) test adequacy metrics such as code coverage and mutation score. In this work, we revisit existing oracle generation studies plus gpt-3.5 to empirically investigate the current standing of their performance in textual similarity and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on seven textual similarity and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the textual similarity metrics and test adequacy metrics. For instance, gpt-3.5 on the jackrabbit-oak project had the highest performance on all seven textual similarity metrics among the studied NOGs. However, it had the lowest test adequacy metrics compared to all the studied NOGs. We further conducted a qualitative analysis to explore the reasons behind our observations. We found that oracles with high textual similarity metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle’s parameters, making them hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low textual similarity metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation on textual similarity and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.

Jiho Shin

York University

Canada

Hadi Hemmati

York University

Canada

Moshi Wei

York University

Canada

Song Wang

York University

Canada

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	AI for Testing and QA 6Journal-first Papers / Research Track / New Ideas and Emerging Results (NIER) at 214 Chair(s): Ladan Tahvildari University of Waterloo

14:00 15m Talk		Treefix: Enabling Execution with a Tree of Prefixes Research Track Beatriz Souza Universität Stuttgart, Michael Pradel University of Stuttgart Pre-print
14:15 15m Talk		Assessing Evaluation Metrics for Neural Test Oracle Generation Journal-first Papers Jiho Shin York University, Hadi Hemmati York University, Moshi Wei York University, Song Wang York University
14:30 15m Talk		Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement Journal-first Papers Saurabhsingh Rajput Dalhousie University, Tim Widmayer University College London (UCL), Ziyuan Shang Nanyang Technological University, Maria Kechagia National and Kapodistrian University of Athens, Federica Sarro University College London, Tushar Sharma Dalhousie University
14:45 15m Talk		Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality Journal-first Papers Hao Li Queen's University, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Cor-Paul Bezemer University of Alberta Link to publication DOI Pre-print
15:00 15m Talk		Evaluating the Generalizability of LLMs in Automated Program Repair New Ideas and Emerging Results (NIER) Fengjie Li Tianjin University, Jiajun Jiang Tianjin University, Jiajun Sun Tianjin University, Hongyu Zhang Chongqing University Pre-print
15:15 15m Talk		How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study New Ideas and Emerging Results (NIER) Alejandro Velasco William & Mary, Daniel Rodriguez-Cardenas William & Mary, David Nader Palacio William & Mary, Lutfar Rahman Alif University of Dhaka, Denys Poshyvanyk William & Mary Pre-print