Cats Are Not Fish: Deep Learning Testing Calls for Out-Of-Distribution Awareness (ASE 2020 - Research Papers)

Who

David Berend, Xiaofei Xie, Lei Ma, Lingjun Zhou, Yang Liu, Chi Xu, Jianjun Zhao

Track

ASE 2020 Research Papers

Time Zone

The program is currently displayed in (UTC) Coordinated Universal Time.

Use conference time zone: (UTC) Coordinated Universal TimeSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 24 Sep 2020 09:30 - 09:50 at Koala - Testing and AI Chair(s): Xiaoyuan Xie

Abstract

As Deep Learning (DL) is continuously adopted in many industrial applications, its quality and reliability start to raise concerns. Similar to the traditional software development process, testing the DL software to uncover its defects at an early stage is an effective way to reduce the risks after deployment. According to the fundamental assumption of deep learning, the DL software does not provide statistical guarantee and has limited capability in handling data that go beyond it’s learned distribution, i.e., out-of-distribution (OOD) data. Recent progress has been made in designing novel testing techniques for DL software, which can detect thousands of errors. However, the current state-of-the-art DL testing techniques do not take the distribution of generated test data into consideration. It is therefore hard to judge whether the “identified errors” are indeed meaningful errors to the DL application (i.e., due to the quality issue of the model) or outliers that cannot be handled by the current model (i.e., due to the lack of training data).

To fill this gap, we take the first step and conduct a large scale empirical study, with a total of 451 experiment configurations, 42 DNN and over 1.2 million test data instances, to investigate and characterize the capability of DL software from data distribution perspective towards understanding its impact on the DL testing techniques. We first perform a large scale empirical study on five state-of-the-art OOD detection techniques to investigate their performance in distinguishing the in-distribution (ID) data and OOD data. Based on the results, we select the best OOD detection technique and investigate the characteristics of the generated test data by different DL testing techniques, i.e., 8 mutation operators and 6 testing criteria. The results demonstrate that some mutation operators and testing criteria tend to guide generating OOD test data, while some show to be the opposite. After identifying the ID and OOD errors, we further investigate their effectiveness in DL model robustness enhancement. The results confirm the importance of data distribution awareness in both testing and enhancement phases outperforming distribution unaware retraining up to 21.5%. As deep learning follows the data-driven development paradigm, whose behavior highly depends on the training data, the results of this paper confirm the importance and calls for the inclusion of data-awareness during designing new testing and analysis techniques for DL software.

David Berend

Nanyang Technological University, Singapore

Singapore

Xiaofei Xie

Nanyang Technological University

Lei Ma

Kyushu University

Japan

Lingjun Zhou

College of Intelligence and Computing, Tianjin University

China

Yang Liu

Nanyang Technological University, Singapore

Singapore

Chi Xu

Singapore Institute of Manufacturing Technology, A*Star

Jianjun Zhao