Chatting about flaky tests with standard LLMs. An empirical exploration (Quality Evaluation of ML-based Software Systems 2025)

Mon 1 - Wed 3 December 2025 Salerno , Italy

Who

Marcin Szwarc, Bartosz Walter

Track

Quality Evaluation of ML-based Software Systems 2025

Abstract

Flaky tests yield inconsistent results without code changes, which undermines software reliability and may increase development costs, emphasizing the importance of effective detection methods. Despite various research efforts over the past 15 years, existing techniques often show limited accuracy and adoption. This study explores whether commonly available Large Language Models (LLMs) are suitable for detecting flaky tests in software. Using the International Dataset of Flaky Tests, we asked selected LLMs, including GPT and Gemini, to classify Java test cases as flaky or non-flaky. Results show that LLMs are unable to do so consistently. This research underscores the challenges of using general-purpose LLMs for flaky test detection and highlights the need for more effective solutions.

Chatting about flaky tests with standard LLMs. An empirical exploration

Marcin Szwarc

Poznań University of Technology, Poland

Poland

Bartosz Walter

Poznań University of Technology, Poland

Poland

Tracks

Workshops