LLMShot: Reducing snapshot testing maintanence via LLMs (ICSME 2025 - New Ideas and Emerging Results Track) - ICSME 2025 - International Conference on Software Maintenance and Evolution

Who

Ergün Batuhan Kaynak, Mayasah Lami, Sahand Moslemi Yengejeh, Anil Koyuncu

Track

ICSME 2025 NIER Track

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 11 Sep 2025 14:20 - 14:30 at Case Room 3 260-055 - Session 9 - Testing 3 Chair(s): Sigrid Eldh

Abstract

Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages Vision-Language Models (VLMs) to automatically analyze snapshot test failures through semantic classification of UI changes. To evaluate LLMShot’s effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.

Link to Preprint

https://arxiv.org/abs/2507.10062

Ergün Batuhan Kaynak

Bilkent University

Mayasah Lami

Bilkent University

Turkey

Sahand Moslemi Yengejeh

Bilkent University

Turkey

Anil Koyuncu

Bilkent University

Turkey

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 11 Sep
Displayed time zone: Auckland, Wellington change

13:30 - 15:00	Session 9 - Testing 3Journal First Track / NIER Track / Tool Demonstration Track / Research Papers Track / Registered Reports at Case Room 3 260-055 Chair(s): Sigrid Eldh Ericsson AB, Mälardalen University, Carleton University

13:30 15m Full-paper		Metamorphic Testing of Large Language Models for Natural Language Processing Research Papers Track Steven Cho The University of Auckland, New Zealand, Stefano Ruberto JRC European Commission, Valerio Terragni University of Auckland Pre-print
13:45 15m		Onweer: Automated Resilience Testing through Fuzzing Research Papers Track Gilles Coremans Vrije Universiteit Brussel, Coen De Roover Vrije Universiteit Brussel Pre-print
14:00 10m		Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement Registered Reports Zhaorui Yang University of California, Riverside, Yuxin Qiu University of California at Riverside, Haichao Zhu Meta, Qian Zhang University of California at Riverside
14:10 10m Short-paper		Prioritizing Test Smells: An Empirical Evaluation of Quality Metrics and Developer Perceptions NIER Track Md Arif Hasan University of Dhaka, Bangladesh, Toukir Ahammed Institute of Information Technology, University of Dhaka Link to publication DOI Pre-print
14:20 10m		LLMShot: Reducing snapshot testing maintanence via LLMs NIER Track Ergün Batuhan Kaynak Bilkent University, Mayasah Lami Bilkent University, Sahand Moslemi Yengejeh Bilkent University, Anil Koyuncu Bilkent University Pre-print
14:30 15m		Combinatorial Transition Testing in Dynamically Adaptive Systems: Implementation and Test Oracle Journal First Track Pierre Martou UCLouvain / ICTEAM, Benoît Duhoux Université catholique de Louvain, Belgium, Kim Mens Université catholique de Louvain, ICTEAM institute, Belgium, Axel Legay Université Catholique de Louvain, Belgium
14:45 10m		LLMLOOP: Improving LLM-Generated Code and Tests through Automated Iterative Feedback Loops Tool Demonstration Track Ravin Ravi University of Auckland, Dylan Bradshaw University of Auckland, Stefano Ruberto JRC European Commission, Gunel Jahangirova King's College London, Valerio Terragni University of Auckland Pre-print