When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study
Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate.
In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.
Thu 18 MayDisplayed time zone: Hobart change
11:00 - 12:30 | AI testing 1Technical Track / DEMO - Demonstrations / Journal-First Papers at Meeting Room 102 Chair(s): Matthew B Dwyer University of Virginia | ||
11:00 15mTalk | When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study Technical Track Pre-print | ||
11:15 15mTalk | Fuzzing Automatic Differentiation in Deep-Learning Libraries Technical Track Chenyuan Yang University of Illinois at Urbana-Champaign, Yinlin Deng University of Illinois at Urbana-Champaign, Jiayi Yao The Chinese University of Hong Kong, Shenzhen, Yuxing Tu Huazhong University of Science and Technology, Hanchi Li University of Science and Technology of China, Lingming Zhang University of Illinois at Urbana-Champaign | ||
11:30 15mTalk | Lightweight Approaches to DNN Regression Error Reduction: An Uncertainty Alignment Perspective Technical Track Zenan Li Nanjing University, China, Maorun Zhang Nanjing University, China, Jingwei Xu , Yuan Yao Nanjing University, Chun Cao Nanjing University, Taolue Chen Birkbeck University of London, Xiaoxing Ma Nanjing University, Jian Lu Nanjing University Pre-print | ||
11:45 7mTalk | DeepJudge: A Testing Framework for Copyright Protection of Deep Learning Models DEMO - Demonstrations Jialuo Chen Zhejiang University, Youcheng Sun The University of Manchester, Jingyi Wang Zhejiang University, Peng Cheng Zhejiang University, Xingjun Ma Deakin University | ||
11:52 7mTalk | DeepCrime: from Real Faults to Mutation Testing Tool for Deep Learning DEMO - Demonstrations | ||
12:00 7mTalk | DiverGet: a Search-Based Software Testing approach for Deep Neural Network Quantization assessment Journal-First Papers Ahmed Haj Yahmed École Polytechnique de Montréal, Houssem Ben Braiek École Polytechnique de Montréal, Foutse Khomh Polytechnique Montréal, Sonia Bouzidi National Institute of Applied Science and Technology, Rania Zaatour Potsdam Institute for Climate Impact Research | ||
12:07 15mTalk | Revisiting Neuron Coverage for DNN Testing: A Layer-Wise and Distribution-Aware Criterion Technical Track Yuanyuan Yuan The Hong Kong University of Science and Technology, Qi Pang HKUST, Shuai Wang Hong Kong University of Science and Technology |