Do Automatic Test Generation Tools Generate Flaky Tests? (ICSE 2024 - Research Track)

Who

Martin Gruber, Muhammad Firhard Roslan, Owain Parry, Fabian Scharnböck, Phil McMinn, Gordon Fraser

Track

ICSE 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 18 Apr 2024 11:00 - 11:15 at Grande Auditório - Testing 3 Chair(s): José Miguel Rojas

Abstract

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and pro- posed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The prevalence and nature of flaky tests produced by test generation tools remains largely unknown. We ask whether such tools also produce flaky tests and how these differ from developer-written ones. Furthermore, we evaluate mechanisms that suppress flaky test generation. We sample 6 356 projects written in Java or Python. For each project, we generate tests using EvoSuite (Java) and Pynguin (Python), and execute each test 200 times, looking for inconsistent outcomes. Our results show that flakiness is at least as common in generated tests as in developer-written tests. Nevertheless, exist- ing flakiness suppression mechanisms are effective in alleviating this issue (71.7 % fewer flaky tests). Compared to developer-written flaky tests, the causes of generated flaky tests are distributed differ- ently. Their non-deterministic behavior is more frequently caused by randomness, rather than by networking and concurrency. Using flakiness suppression, the remaining flaky tests differ significantly from any flakiness previously reported, where most are attributable to runtime optimizations and EvoSuite-internal resource thresholds. These insights, with the accompanying dataset, can help maintain- ers to improve test generation tools, give recommendations for developers using these tools, and serve as a foundation for future research in test flakiness or test generation.

Link to Preprint

https://arxiv.org/pdf/2310.05223.pdf

Martin Gruber

BMW Group, University of Passau

Germany

Muhammad Firhard Roslan

University of Sheffield

Malaysia

Owain Parry

The University of Sheffield

United Kingdom

Fabian Scharnböck

University of Passau