ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

Flaky tests, those that fail non-deterministically against the same version of code, pose a well-established challenge to software developers. In this paper, we characterize the overlooked phenomenon of test \textbf{FLIM}siness: \textbf{FL}akiness \textbf{I}nduced by \textbf{M}utations to the code under test. These mutations are generated by the same operators found in out-of-the-box mutation testing tools. Flimsiness has profound implications for researchers in software testing. While previous work analyzed the impact of pre-existing flaky tests on mutation testing, we reveal that mutations themselves can induce flakiness, exposing a previously neglected threat. This has impacts beyond mutation testing, calling into question the reliability of any technique that relies on deterministic test outcomes in response to mutations. On the other hand, flimsiness also presents an opportunity to surface potential flakiness that may otherwise remain hidden. Where prior work has perturbed the execution environment to augment rerunning or the test code to support benchmarking, our work advances these efforts by perturbing a third major source of flakiness: the code under test. We conducted an empirical study on over half a million test suite executions across 28 diverse Python projects. Our robust statistical analysis on more than 30 million mutant-test pairs unveiled flimsiness in 54% of projects, highlighting its prevalence. We found that augmenting the standard rerunning flaky test detection strategy with mutations to the code under test detects a substantially larger number of flaky tests (median 740 vs. 163) and uncovers many that the standard strategy is unlikely to detect.