Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness
As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce the risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that go beyond it’s learned distribution, i.e., out-of-distribution (OOD) data. Recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors. However, the current state-of-the-art DL testing techniques do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to the quality issue of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data).
To fill this gap, we take the first step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 DNN and over 1.2 million test data instances, to investigate and characterize the capability of DL software from data distribution perspective towards understanding its impact on the DL testing techniques. We first perform a large scale empirical study on five state-of-the-art OOD detection techniques to investigate their performance in distinguishing the in-distribution (ID) data and OOD data. Based on the results, we select the best OOD detection technique and investigate the characteristics of the generated test data by different DL testing techniques, i.e., 8 mutation operators and 6 testing criteria. The results demonstrate that some mutation operators and testing criteria tend to guide generating OOD test data, while some show to be the opposite. After identifying the ID and OOD errors, we further investigate their effectiveness in DL model robustness enhancement. The results confirm the importance of data distribution awareness in both testing and enhancement phases outperforming distribution unaware retraining up to 21.5%. As deep learning follows the data-driven development paradigm, whose behavior highly depends on the training data, the results of this paper confirm the importance and calls for the inclusion of data-awareness during designing new testing and analysis techniques for DL software.
Thu 24 Sep Times are displayed in time zone: (UTC) Coordinated Universal Time change
|09:10 - 09:30|
|09:30 - 09:50|
David BerendNanyang Technological University, Singapore, Xiaofei XieNanyang Technological University, Lei MaKyushu University, Lingjun ZhouCollege of Intelligence and Computing, Tianjin University, Yang LiuNanyang Technological University, Singapore, Chi XuSingapore Institute of Manufacturing Technology, A*Star, Jianjun ZhaoKyushu University
|09:50 - 10:10|