LLMorph: Automated Metamorphic Testing of Large Language Models (ASE 2025 - Tool Demonstration Track) - ASE 2025

Sun 16 - Thu 20 November 2025 Seoul, South Korea

Who

Steven Cho, Stefano Ruberto, Valerio Terragni

Track

ASE 2025 Tool Demonstration Track

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

When

Tue 18 Nov 2025 15:00 - 18:00 at Walker Hall - Tools - LLMs and Agents

Abstract

Automated testing is essential for evaluating and improving the reliability of Large Language Models (LLMs), yet the lack of automated oracles for verifying output correctness remains a key challenge. We present LLMorph, an automated testing tool specifically designed for LLMs performing NLP tasks, which leverages Metamorphic Testing (MT) to uncover faulty behaviors without relying on human-labeled data. MT uses Metamorphic Relations (MRs) to generate follow-up inputs from source test input, enabling detection of inconsistencies in model outputs without the need of expensive labelled data. LLMorph is aimed at researchers and developers who want to evaluate the robustness of LLM-based NLP systems. In this paper, we detail the design, implementation, and practical usage of LLMorph, demonstrating how it can be easily extended to any LLM, NLP task, and set of MRs. In our evaluation, we applied 36 MRs across four NLP benchmarks, testing three state-of-the-art LLMs: GPT-4, LLAMA3, and HERMES 2. This produced over 561,000 test executions. The results demonstrate LLMorph’s effectiveness in automatically exposing incorrect model behaviors at scale.

The tool source code is available at https://github.com/steven-b-cho/llmorph. A screencast demo is available at https://youtu.be/sHmqdieCfw4.

Steven Cho

The University of Auckland, New Zealand

Stefano Ruberto

JRC European Commission

Italy

Valerio Terragni

University of Auckland

New Zealand

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Session Program

Tue 18 Nov
Displayed time zone: Seoul change

	15:00 - 18:00	Tools - LLMs and AgentsTool Demonstration Track at Walker Hall

	15:00 3h Demonstration		APIDA-Chat: Structured Synthesis of API Search Dialogues to Bootstrap Conversational Agents Tool Demonstration Track Zachary Eberhart University of Notre Dame, Collin McMillan University of Notre Dame
	15:00 3h Demonstration		PROXiFY: A Bytecode Analysis Tool for Detecting and Classifying Proxy Contracts in Ethereum Smart Contracts Tool Demonstration Track Ilham Qasse Reykjavik University, Mohammad Hamdaqa Polytechnique Montreal, Björn Þór Jónsson Reykjavik University
	15:00 3h Demonstration		DeepTx: Real-Time Transaction Risk Analysis via Multi-Modal Features and LLM Reasoning Tool Demonstration Track Yixuan Liu Nanyang Technological University, Xinlei Li Nanyang Technological University, Yi Li Nanyang Technological University Pre-print
	15:00 3h Demonstration		WIBE: Watermarks for generated Images - Benchmarking & Evaluation Tool Demonstration Track Aleksey Yakushev ISP RAS, Aleksandr Akimenkov ISP RAS, Khaled Abud MSU AI Institute, Dmitry Obydenkov ISP RAS, Irina Serzhenko MIPT, Kirill Aistov Huawei Research Center, Egor Kovalev MSU, Stanislav Fomin ISP RAS, Anastasia Antsiferova ISP RAS Research Center, MSU AI Institute, Kirill Lukianov ISP RAS Research Center, MIPT, Yury Markin ISP RAS
	15:00 3h Demonstration		EyeNav: Accessible Webpage Interaction and Testing using Eye-tracking and NLP Tool Demonstration Track Juan Diego Yepes-Parra Universidad de los Andes, Colombia, Camilo Escobar-Velásquez Universidad de los Andes, Colombia Link to publication Media Attached
	15:00 3h Demonstration		Quirx: A Mutation-Based Framework for Evaluating Prompt Robustness in LLM-based Software Tool Demonstration Track Souhaila Serbout University of Zurich, Zurich, Switzerland
	15:00 3h Demonstration		BenGQL: An Extensible Benchmarking Framework for Automated GraphQL Testing Tool Demonstration Track Abenezer Angamo Independent Researcher, Marcello Maugeri University of Catania Media Attached
	15:00 3h Demonstration		evalSmarT: An LLM-Based Evaluation Framework for Smart Contract Comment Generation Tool Demonstration Track Fatou Ndiaye MBODJI SnT, University of Luxembourg, Mame Marieme Ciss SOUGOUFARA UCAD, Senegal, Wendkuuni Arzouma Marc Christian OUEDRAOGO SnT, University of Luxembourg, Alioune Diallo University of Luxembourg, Kui Liu Huawei, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg Pre-print
	15:00 3h Demonstration		LLMorph: Automated Metamorphic Testing of Large Language Models Tool Demonstration Track Steven Cho The University of Auckland, New Zealand, Stefano Ruberto JRC European Commission, Valerio Terragni University of Auckland
	15:00 3h Demonstration		TRUSTVIS: A Multi-Dimensional Trustworthiness Evaluation Framework for Large Language Models Tool Demonstration Track Ruoyu Sun University of Alberta, Canada, Da Song University of Alberta, Jiayang Song Macau University of Science and Technology, Yuheng Huang The University of Tokyo, Lei Ma The University of Tokyo & University of Alberta
	15:00 3h Demonstration		GUI-ReRank: Enhancing GUI Retrieval with Multi-Modal LLM-based Reranking Tool Demonstration Track Kristian Kolthoff Institute for Software and Systems Engineering, Clausthal University of Technology, Felix Kretzer human-centered systems Lab (h-lab), Karlsruhe Institute of Technology (KIT) , Christian Bartelt Institute for Software and Systems Engineering, TU Clausthal, Alexander Maedche human-centered systems Lab (h-lab), Karlsruhe Institute of Technology (KIT) , Simone Paolo Ponzetto Data and Web Science Group, University of Mannheim Pre-print Media Attached
	15:00 3h Demonstration		StackPlagger: A System for Identifying AI-Code Plagiarism on Stack Overflow Tool Demonstration Track Aman Swaraj Dept. of Computer Science & Engineering, Indian Institute of Technology, Roorkee, India, Harsh Goyal Indian Institute of Technology, Roorkee, Sumit Chadgal Indian Institute of Technology, Roorkee, Sandeep Kumar Dept. of Computer Science & Engineering, Indian Institute of Technology, Roorkee, India
	15:00 3h Demonstration		AgentDroid: A Multi-Agent Tool for Detecting Fraudulent Android Applications Tool Demonstration Track Ruwei Pan Chongqing University, Hongyu Zhang Chongqing University, Zhonghao Jiang , Ran Hou Chongqing University