SANER 2025
Tue 4 - Fri 7 March 2025 Montréal, Québec, Canada
Wed 5 Mar 2025 11:30 - 11:45 at L-1710 - Empirical Studies & LLM Chair(s): Diego Elias Costa

Software testing is a mainstream approach for software quality assurance. One fundamental challenge for testing is that in many practical situations, it is very difficult to verify the correctness of test results given inputs for software under test (SUT), which is known as the oracle problem. Metamorphic Testing (MT) is a software testing technique that can effectively alleviate the oracle problem. The core component of MT is a set of Metamorphic Relations (MRs), which are basically the necessary properties of SUT, represented in the form of relationship among multiple inputs and their corresponding expected outputs.Different methods have been proposed to support the systematic MR identification. However, most of them still rely heavily on test engineers’ understanding of the SUT and involve massive manual work.

Although a few preliminary studies have shown LLMs’ viability in generating MRs, there does not exist a thorough and in-depth investigation on their capability in MR identification. We are thus motivated to conduct a comprehensive and largescale empirical study to systematically evaluate the performance of LLMs in identifying appropriate MRs for a wide variety of software systems. This study makes use of 37 SUTs collected from previous MT studies. Prompts are constructed for two LLMs, gpt-3.5-turbo-1106 and gpt-4-1106-preview, to perform the MR identification for each SUT. The empirical results demonstrate that both LLMs can generate a large amount of MR candidates (MRCs). Among them, 34.03% and 50.04% of all MRCs are identified as the MRs valid for the corresponding SUT, respectively. In addition, 82.81% and 88.22% of all valid MRs had never been identified in previous studies. Our study not only reinforces LLM-based MR identification as a promising research direction for MT, but also provides some practical guidelines for how to further improve LLMs’ performance in generating good MRs.

Wed 5 Mar

Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30
11:00
15m
Talk
Beyond pip install: Evaluating LLM agents for the automated installation of Python projects
Research Papers
Louis Mark Milliken KAIST, Sungmin Kang National University of Singapore, Shin Yoo Korea Advanced Institute of Science and Technology
Pre-print
11:18
12m
Talk
On the Compression of Language Models for Code: An Empirical Study on CodeBERT
Research Papers
Giordano d'Aloisio University of L'Aquila, Luca Traini University of L'Aquila, Federica Sarro University College London, Antinisca Di Marco University of L'Aquila
Pre-print
11:30
15m
Talk
Can Large Language Models Discover Metamorphic Relations? A Large-Scale Empirical Study
Research Papers
Jiaming Zhang University of Science and Technology Beijing, Chang-ai Sun University of Science and Technology Beijing, Huai Liu Swinburne University of Technology, Sijin Dong University of Science and Technology Beijing
11:45
15m
Talk
Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model
Reproducibility Studies and Negative Results (RENE) Track
Salimata Sawadogo Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Aminata Sabané Université Joseph KI-ZERBO, Centre d'Excellence CITADELLE, Rodrique Kafando Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Tegawendé F. Bissyandé University of Luxembourg
12:00
15m
Talk
Language Models to Support Multi-Label Classification of Industrial Data
Industrial Track
Waleed Abdeen Blekinge Institute of Technology, Michael Unterkalmsteiner , Krzysztof Wnuk Blekinge Institute of Technology , Alessio Ferrari CNR-ISTI, Panagiota Chatzipetrou