PROFES 2025
Mon 1 - Wed 3 December 2025 Salerno , Italy

Flaky tests yield inconsistent results without code changes, which undermines software reliability and may increase development costs, emphasizing the importance of effective detection methods. Despite various research efforts over the past 15 years, existing techniques often show limited accuracy and adoption. This study explores whether commonly available Large Language Models (LLMs) are suitable for detecting flaky tests in software. Using the International Dataset of Flaky Tests, we asked selected LLMs, including GPT and Gemini, to classify Java test cases as flaky or non-flaky. Results show that LLMs are unable to do so consistently. This research underscores the challenges of using general-purpose LLMs for flaky test detection and highlights the need for more effective solutions.