LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations?
Large Language Models (LLMs) are transforming software engineering, including mobile Augmented Reality (AR) applications. AR software behavior often depends on dynamic environmental factors, making it difficult to use conventional testing and verification approaches. Metamorphic Testing (MT) offers an alternative by assessing whether expected transformations hold across varied conditions. However, there is limited work exploring how well LLMs can detect these transformations—Metamorphic Relations (MRs)—in applications. We propose a stability-driven evaluation framework that examines whether LLMs consistently apply MRs across rephrasings. Our study finds that StarCoder and CodeLlama exhibit higher stability in MR identification compared to the general-purpose model Gemma. Additionally, we use a multi-agent debate framework to investigate whether combining multiple perspectives improves consistency in MR identification. The debate mechanism reduces MR inconsistencies, leading to more stable MR identification across all MRs. While debate helps stabilize MR identification, our evaluation against human-labeled ground truth reveals that stability alone does not always correlate with correctness. Some models maintain stable yet incorrect predictions(CodeLlama), whereas debate enhances both consistency and correctness alignment, making LLM reasoning more reliable. This work contributes a method to evaluate LLMs in the absence of ground truth, establishing stability as a metric for assessing model reliability. Applying a multi-agent debate framework offers a promising approach to enhancing LLM reliability, especially in contexts where the ground truth is elusive.
Thu 20 NovDisplayed time zone: Seoul change
10:30 - 12:30 | |||
10:30 25mFull-paper | LLMs in Debate: Does Arguing Make Them Better at Detecting Metamorphic Relations? AgenticSE Dibyendu Brinto Bose Virginia Tech, USA , Yoseph Berhanu Alebachew Virginia Tech, Chris Brown Virginia Tech | ||
10:55 25mFull-paper | A 3-Layer Agentic Model for Nonfunctional Requirements in Software Engineering AgenticSE Ehsan Zabardast Nordea / Blekinge Institute of Technology, Tiago Vieira , Tony Gorschek Blekinge Institute of Technology / DocEngineering | ||
11:20 15mTalk | Transforming Natural Language into Formal Specifications AgenticSE Kuangxiangzi Liu , Alexander Liggesmeyer , Dhiman Chakraborty , Andreas Zeller CISPA Helmholtz Center for Information Security | ||
11:35 15mTalk | PRIMA: Enabling User Agency and Control in Mobile GUI Agent Autonomy AgenticSE | ||