TCSE logo 
 Sigsoft logo
Sustainability badge

This program is tentative and subject to change.

Fri 2 May 2025 14:15 - 14:30 at 214 - AI for Testing and QA 6

Recently, deep learning models have shown promising results in test oracle generation. Neural Oracle Generation (NOG) models are commonly evaluated using static (automatic) metrics which are mainly based on textual similarity of the output, e.g. BLEU, ROUGE-L, METEOR, and Accuracy. However, these textual similarity metrics may not reflect the testing effectiveness of the generated oracle within a test suite, which is often measured by dynamic (execution-based) test adequacy metrics such as code coverage and mutation score. In this work, we revisit existing oracle generation studies plus gpt-3.5 to empirically investigate the current standing of their performance in textual similarity and test adequacy metrics. Specifically, we train and run four state-of-the-art test oracle generation models on seven textual similarity and two test adequacy metrics for our analysis. We apply two different correlation analyses between these two different sets of metrics. Surprisingly, we found no significant correlation between the textual similarity metrics and test adequacy metrics. For instance, gpt-3.5 on the jackrabbit-oak project had the highest performance on all seven textual similarity metrics among the studied NOGs. However, it had the lowest test adequacy metrics compared to all the studied NOGs. We further conducted a qualitative analysis to explore the reasons behind our observations. We found that oracles with high textual similarity metrics but low test adequacy metrics tend to have complex or multiple chained method invocations within the oracle’s parameters, making them hard for the model to generate completely, affecting the test adequacy metrics. On the other hand, oracles with low textual similarity metrics but high test adequacy metrics tend to have to call different assertion types or a different method that functions similarly to the ones in the ground truth. Overall, this work complements prior studies on test oracle generation with an extensive performance evaluation on textual similarity and test adequacy metrics and provides guidelines for better assessment of deep learning applications in software test generation in the future.

This program is tentative and subject to change.

Fri 2 May

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
14:00
15m
Talk
Treefix: Enabling Execution with a Tree of Prefixes
Research Track
Beatriz Souza Universität Stuttgart, Michael Pradel University of Stuttgart
Pre-print
14:15
15m
Talk
Assessing Evaluation Metrics for Neural Test Oracle Generation
Journal-first Papers
Jiho Shin York University, Hadi Hemmati York University, Moshi Wei York University, Song Wang York University
14:30
15m
Talk
Enhancing Energy-Awareness in Deep Learning through Fine-Grained Energy Measurement
Journal-first Papers
Saurabhsingh Rajput Dalhousie University, Tim Widmayer University College London (UCL), Ziyuan Shang Nanyang Technological University, Maria Kechagia National and Kapodistrian University of Athens, Federica Sarro University College London, Tushar Sharma Dalhousie University
14:45
15m
Talk
Studying the Impact of TensorFlow and PyTorch Bindings on Machine Learning Software Quality
Journal-first Papers
Hao Li Queen's University, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Cor-Paul Bezemer University of Alberta
15:00
15m
Talk
Evaluating the Generalizability of LLMs in Automated Program Repair
New Ideas and Emerging Results (NIER)
Fengjie Li Tianjin University, Jiajun Jiang Tianjin University, Jiajun Sun Tianjin University, Hongyu Zhang Chongqing University
Pre-print
15:15
15m
Talk
How Propense Are Large Language Models at Producing Code Smells? A Benchmarking Study
New Ideas and Emerging Results (NIER)
Alejandro Velasco William & Mary, Daniel Rodriguez-Cardenas , David Nader Palacio William & Mary, Lutfar Rahman Alif University of Dhaka, Denys Poshyvanyk William & Mary
:
:
:
: