Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests (ICSE 2023 - Technical Track)

Who

Chunqiu Steven Xia, Saikat Dutta, Sasa Misailovic, Darko Marinov, Lingming Zhang

Track

ICSE 2023 Technical Track

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 May 2023 11:45 - 12:00 at Meeting Room 101 - AI testing 2 Chair(s): Gunel Jahangirova

Abstract

Testing Machine Learning (ML) projects is challenging due to inherent \textit{non-determinism} of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting \textit{assertion bounds} for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs.

We present FASER – the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal \textit{assertion bounds}. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 26% of the studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers and 12 pull requests have already been accepted.

Chunqiu Steven Xia

University of Illinois at Urbana-Champaign

United States

Saikat Dutta

University of Illinois at Urbana-Champaign

United States

Sasa Misailovic

University of Illinois at Urbana-Champaign

United States

Darko Marinov

University of Illinois at Urbana-Champaign

United States

Lingming Zhang

University of Illinois at Urbana-Champaign

United States

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 May
Displayed time zone: Hobart change

11:00 - 12:30	AI testing 2Technical Track / Journal-First Papers at Meeting Room 101 Chair(s): Gunel Jahangirova USI Lugano, Switzerland

11:00 15m Talk		Aries: Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation Technical Track Qiang Hu University of Luxembourg, Yuejun GUo University of Luxembourg, Xiaofei Xie Singapore Management University, Maxime Cordy University of Luxembourg, Luxembourg, Lei Ma University of Alberta, Mike Papadakis University of Luxembourg, Luxembourg, Yves Le Traon University of Luxembourg, Luxembourg Pre-print
11:15 15m Talk		Testing the Plasticity of Reinforcement Learning Based Systems Journal-First Papers Matteo Biagiola Università della Svizzera italiana, Paolo Tonella USI Lugano Link to publication DOI Pre-print
11:30 15m Talk		CC: Causality-Aware Coverage Criterion for Deep Neural Networks Technical Track Zhenlan Ji The Hong Kong University of Science and Technology, Pingchuan Ma HKUST, Yuanyuan Yuan The Hong Kong University of Science and Technology, Shuai Wang Hong Kong University of Science and Technology
11:45 15m Talk		Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests Technical Track Chunqiu Steven Xia University of Illinois at Urbana-Champaign, Saikat Dutta University of Illinois at Urbana-Champaign, Sasa Misailovic University of Illinois at Urbana-Champaign, Darko Marinov University of Illinois at Urbana-Champaign, Lingming Zhang University of Illinois at Urbana-Champaign
12:00 15m Talk		Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems Technical Track Fitash ul haq , Donghwan Shin The University of Sheffield, Lionel Briand University of Luxembourg; University of Ottawa Pre-print
12:15 15m Talk		Reliability Assurance for Deep Neural Network Architectures Against Numerical Defects Technical Track Linyi Li University of Illinois at Urbana-Champaign, Yuhao Zhang University of Wisconsin-Madison, Luyao Ren Peking University, China, Yingfei Xiong Peking University, Tao Xie Peking University Pre-print