ICSME 2025
Sun 7 - Fri 12 September 2025 Auckland, New Zealand
Thu 11 Sep 2025 13:30 - 13:45 at Case Room 3 260-055 - Session 9 - Testing 3 Chair(s): Sigrid Eldh

Using Large Language Models (LLMs) to perform Natural Language Processing (NLP) tasks has been becoming increasingly pervasive in recent times. The versatile nature of LLMs makes them applicable to a wide range of such tasks. While the performance of recent LLMs is generally outstanding, several studies have shown that LLMs can often produce incorrect results. Automatically identifying these faulty behaviors is extremely useful for improving the effectiveness of LLMs. One obstacle to this is the limited availability of labelled datasets, necessitating an oracle to determine the correctness of LLMs behaviors. Metamorphic Testing (MT) is a popular testing approach that alleviates the oracle problem. At the core of MT are Metamorphic Relations (MRs) that define the relationship between the outputs of related inputs. MT can expose faulty behaviors without the need for explicit oracles (i.e., labelled datasets). This paper presents the most comprehensive study of MT for LLMs to date. We conducted a literature review and collected 191 MRs for NLP tasks. We implemented a representative subset of them (38 MRs) to conduct a series of experiments with four popular LLMs, running ∼550K metamorphic test cases. The results shed light on the capabilities and opportunities of MT for LLMs, as well as its limitations.

Thu 11 Sep

Displayed time zone: Auckland, Wellington change

13:30 - 15:00
Session 9 - Testing 3Journal First Track / NIER Track / Tool Demonstration Track / Research Papers Track / Registered Reports at Case Room 3 260-055
Chair(s): Sigrid Eldh Ericsson AB, Mälardalen University, Carleton University
13:30
15m
Full-paper
Metamorphic Testing of Large Language Models for Natural Language Processing
Research Papers Track
Steven Cho The University of Auckland, New Zealand, Stefano Ruberto JRC European Commission, Valerio Terragni University of Auckland
Pre-print
13:45
15m
Onweer: Automated Resilience Testing through Fuzzing
Research Papers Track
Gilles Coremans Vrije Universiteit Brussel, Coen De Roover Vrije Universiteit Brussel
Pre-print
14:00
10m
Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement
Registered Reports
Zhaorui Yang University of California, Riverside, Yuxin Qiu University of California at Riverside, Haichao Zhu Meta, Qian Zhang University of California at Riverside
14:10
10m
Prioritizing Test Smells: An Empirical Evaluation of Quality Metrics and Developer Perceptions
NIER Track
Md Arif Hasan University of Dhaka, Bangladesh, Toukir Ahammed Institute of Information Technology, University of Dhaka
14:20
10m
LLMShot: Reducing snapshot testing maintanence via LLMs
NIER Track
Ergün Batuhan Kaynak Bilkent University, Mayasah Lami Bilkent University, Sahand Moslemi Yengejeh Bilkent University, Anil Koyuncu Bilkent University
Pre-print
14:30
15m
Combinatorial Transition Testing in Dynamically Adaptive Systems: Implementation and Test Oracle
Journal First Track
Pierre Martou UCLouvain / ICTEAM, Benoît Duhoux Université catholique de Louvain, Belgium, Kim Mens Université catholique de Louvain, ICTEAM institute, Belgium, Axel Legay Université Catholique de Louvain, Belgium
14:45
10m
LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops
Tool Demonstration Track
Ravin Ravi University of Auckland, Dylan Bradshaw University of Auckland, Stefano Ruberto JRC European Commission, Gunel Jahangirova King's College London, Valerio Terragni University of Auckland
Pre-print