Can ChatGPT Repair Non-Order-Dependent Tests? (FTW 2024 - 1st International Flaky Tests Workshop 2024 (FTW 2024))

Who

Yang Chen, Reyhaneh Jabbarvand

Track

FTW 2024

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 14 Apr 2024 15:00 - 15:30 at Amália Rodrigues - Debugging Flaky Tests in Different Domains Chair(s): Owain Parry

Abstract

Regression testing helps developers check whether the latest code changes break software functionality. Flaky tests, which can non- deterministically pass or fail on the same code version, may mislead developers’ concerns, resulting in missing some bugs or spending time pinpointing bugs that do not exist. Existing flakiness detection and mitigation techniques have primarily focused on general order-dependent (OD) and implementation-dependent (ID) flaky tests. There is also a dearth of research on repairing test flakiness, out of which, mostly have focused on repairing OD flaky tests, and a few have explored repairing a subcategory of non-order-dependent (NOD) flaky tests that are caused by asynchronous waits. As a result, there is a demand for devising techniques to reproduce, detect, and repair NOD flaky tests. Large language models (LLMs) have shown great effectiveness in several programming tasks. To explore the potential of LLMs in addressing NOD flakiness, this paper investigates the possibility of using ChatGPT to repair different categories of NOD flaky tests. Our comprehensive study on 118 from the IDoFT dataset shows that ChatGPT, despite as a leading LLM with notable success in multiple code generation tasks, is ineffective in repairing NOD test flakiness, even by following the best practices for prompt crafting. We investigated the reasons behind the failure of using ChatGPT in repairing NOD tests, which provided us valuable insights about the next step to advance the field of NOD test flakiness repair.

Yang ChenAuthor

University of Illinois at Urbana-Champaign

United States

Reyhaneh JabbarvandAuthor