Can Search-Based Testing with Pareto Optimization Effectively Cover Failure-Revealing Test Inputs? (ICST 2025 - Journal-First Papers)

Mon 31 March - Fri 4 April 2025 Naples, Italy

Who

Lev Sorokin, Damir Safin, Shiva Nejati

Track

ICST 2025 Journal-First Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 4 Apr 2025 12:00 - 12:15 at Aula Magna (AM) - Automated Testing Chair(s): Cristian Cadar

Abstract

Search-based software testing (SBST) is a widely-adopted technique for testing complex systems with large input spaces, such as Deep Learning-enabled (DL-enabled) systems. Many SBST techniques focus on Pareto-based optimization where multiple objectives are optimized in parallel to reveal failures. However, it is important to ensure that identified failures are spread throughout the entire failure-inducing area of a search domain, and not clustered in a sub-region. This ensures that identified failures are semantically diverse and reveal a wide range of underlying causes. In this paper, we present a theoretical argument explaining why testing based on Pareto optimization is inadequate for covering failure-inducing areas within a search domain. We support our argument with empirical results obtained by applying two widely used types of Pareto-based optimization techniques, namely NSGA-II (an evolutionary algorithm) and OMOPSO (a swarm-based algorithm), to two DL-enabled systems: an industrial Automated Valet Parking (AVP) system and a system for classifying handwritten digits. We measure the coverage of failure-revealing test inputs in the input space using a metric, that we refer to as the Coverage Inverted Distance (CID) quality indicator. Our results show that NSGA-II and OMOPSO are not more effective than a naïve random search baseline in covering test inputs that reveal failures. We show that this comparison remains valid for failure-inducing regions of various sizes of these two case studies. Further, we show that incorporating a diversity-focused fitness function as well as a repopulation operator in NSGA-II improves, on average, the coverage difference between NSGA-II and random search by 52.1%. However, even after diversification, NSGA-II still does not outperform random testing in covering test inputs that reveal failures. The replication package for this study is available in a GitHub repository (Replication package. https://github.com/ast-fortiss-tum/coverage-emse-24.

Lev Sorokin

Technische Universität München, Germany

Germany

Damir Safin

fortiss

Shiva Nejati

University of Ottawa

Canada