Can Large Language Models Discover Metamorphic Relations? A Large-Scale Empirical Study (SANER 2025 - Research Papers)

Who

Jiaming Zhang, Chang-ai Sun, Huai Liu, Sijin Dong

Track

SANER 2025 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 5 Mar 2025 11:30 - 11:45 at L-1710 - Empirical Studies & LLM Chair(s): Diego Elias Costa

Abstract

Software testing is a mainstream approach for software quality assurance. One fundamental challenge for testing is that in many practical situations, it is very difficult to verify the correctness of test results given inputs for software under test (SUT), which is known as the oracle problem. Metamorphic Testing (MT) is a software testing technique that can effectively alleviate the oracle problem. The core component of MT is a set of Metamorphic Relations (MRs), which are basically the necessary properties of SUT, represented in the form of relationship among multiple inputs and their corresponding expected outputs.Different methods have been proposed to support the systematic MR identification. However, most of them still rely heavily on test engineers’ understanding of the SUT and involve massive manual work.

Although a few preliminary studies have shown LLMs’ viability in generating MRs, there does not exist a thorough and in-depth investigation on their capability in MR identification. We are thus motivated to conduct a comprehensive and largescale empirical study to systematically evaluate the performance of LLMs in identifying appropriate MRs for a wide variety of software systems. This study makes use of 37 SUTs collected from previous MT studies. Prompts are constructed for two LLMs, gpt-3.5-turbo-1106 and gpt-4-1106-preview, to perform the MR identification for each SUT. The empirical results demonstrate that both LLMs can generate a large amount of MR candidates (MRCs). Among them, 34.03% and 50.04% of all MRCs are identified as the MRs valid for the corresponding SUT, respectively. In addition, 82.81% and 88.22% of all valid MRs had never been identified in previous studies. Our study not only reinforces LLM-based MR identification as a promising research direction for MT, but also provides some practical guidelines for how to further improve LLMs’ performance in generating good MRs.

Jiaming Zhang

University of Science and Technology Beijing

China

Chang-ai Sun

University of Science and Technology Beijing

China

Huai Liu

Swinburne University of Technology

Australia

Sijin Dong

University of Science and Technology Beijing

China

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 5 Mar
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	Empirical Studies & LLMIndustrial Track / Research Papers / Reproducibility Studies and Negative Results (RENE) Track at L-1710 Chair(s): Diego Elias Costa Concordia University, Canada

11:00 15m Talk		Beyond pip install: Evaluating LLM agents for the automated installation of Python projects Research Papers Louis Mark Milliken KAIST, Sungmin Kang National University of Singapore, Shin Yoo Korea Advanced Institute of Science and Technology Pre-print
11:18 12m Talk		On the Compression of Language Models for Code: An Empirical Study on CodeBERT Research Papers Giordano d'Aloisio University of L'Aquila, Luca Traini University of L'Aquila, Federica Sarro University College London, Antinisca Di Marco University of L'Aquila Pre-print
11:30 15m Talk		Can Large Language Models Discover Metamorphic Relations? A Large-Scale Empirical Study Research Papers Jiaming Zhang University of Science and Technology Beijing, Chang-ai Sun University of Science and Technology Beijing, Huai Liu Swinburne University of Technology, Sijin Dong University of Science and Technology Beijing
11:45 15m Talk		Revisiting the Non-Determinism of Code Generation by the GPT-3.5 Large Language Model Reproducibility Studies and Negative Results (RENE) Track Salimata Sawadogo Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Aminata Sabané Université Joseph KI-ZERBO, Centre d'Excellence CITADELLE, Rodrique Kafando Centre d'Excellence Interdisciplinaire en Intelligence Artificielle pour le Développement (CITADEL), Tegawendé F. Bissyandé University of Luxembourg
12:00 15m Talk		Language Models to Support Multi-Label Classification of Industrial Data Industrial Track Waleed Abdeen Blekinge Institute of Technology, Michael Unterkalmsteiner , Krzysztof Wnuk Blekinge Institute of Technology , Alessio Ferrari CNR-ISTI, Panagiota Chatzipetrou