LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations? (AgenticSE 2025)

Sun 16 - Thu 20 November 2025 Seoul, South Korea

Who

Dibyendu Brinto Bose, Yoseph Berhanu Alebachew, Chris Brown

Track

AgenticSE 2025

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Nov 2025 10:30 - 10:55 at Grand Hall 4 - Session 2

Abstract

Large Language Models (LLMs) are transforming software engineering, including mobile Augmented Reality (AR) applications. AR software behavior often depends on dynamic environmental factors, making it difficult to use conventional testing and verification approaches. Metamorphic Testing (MT) offers an alternative by assessing whether expected transformations hold across varied conditions. However, there is limited work exploring how well LLMs can detect these transformations—Metamorphic Relations (MRs)—in applications. We propose a stability-driven evaluation framework that examines whether LLMs consistently apply MRs across rephrasings. Our study finds that StarCoder and CodeLlama exhibit higher stability in MR identification compared to the general-purpose model Gemma. Additionally, we use a multi-agent debate framework to investigate whether combining multiple perspectives improves consistency in MR identification. The debate mechanism reduces MR inconsistencies, leading to more stable MR identification across all MRs. While debate helps stabilize MR identification, our evaluation against human-labeled ground truth reveals that stability alone does not always correlate with correctness. Some models maintain stable yet incorrect predictions(CodeLlama), whereas debate enhances both consistency and correctness alignment, making LLM reasoning more reliable. This work contributes a method to evaluate LLMs in the absence of ground truth, establishing stability as a metric for assessing model reliability. Applying a multi-agent debate framework offers a promising approach to enhancing LLM reliability, especially in contexts where the ground truth is elusive.

Dibyendu Brinto Bose

Virginia Tech, USA

United States

Yoseph Berhanu Alebachew

Virginia Tech

United States

Chris Brown

Virginia Tech

United States

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 20 Nov
Displayed time zone: Seoul change

10:30 - 12:30	Session 2AgenticSE at Grand Hall 4

10:30 25m Full-paper		LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations? AgenticSE Dibyendu Brinto Bose Virginia Tech, USA , Yoseph Berhanu Alebachew Virginia Tech, Chris Brown Virginia Tech
10:55 25m Full-paper		A 3-Layer Agentic Model for Nonfunctional Requirements in Software Engineering AgenticSE Ehsan Zabardast Nordea / Blekinge Institute of Technology, Tiago Vieira , Tony Gorschek Blekinge Institute of Technology / DocEngineering
11:20 15m Talk		Transforming Natural Language into Formal Specifications AgenticSE Kuangxiangzi Liu , Alexander Liggesmeyer , Dhiman Chakraborty , Andreas Zeller CISPA Helmholtz Center for Information Security
11:35 15m Talk		PRIMA: Enabling User Agency and Control in Mobile GUI Agent Autonomy AgenticSE Ching-Ting Lin , Zhi-Hong Ye , Yung-Ju Chang