AEON: A Method for Automatic Evaluation of NLP Test Cases (ISSTA 2022 - Technical Papers)

Who

Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, Michael Lyu

Track

ISSTA 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 21 Jul 2022 16:20 - 16:40 at ISSTA 2 - Session 3-6: Neural Networks, Learning, NLP F
Fri 22 Jul 2022 00:00 - 00:20 at ISSTA 2 - Session 1-10: Neural Networks, Learning, NLP A

Abstract

Due to the labor-intensive nature of manual test oracle construction, various automated testing techniques have been proposed to enhance the reliability of Natural Language Processing (NLP) software. In theory, these techniques mutate an existing test case (\textit{e.g.}, a sentence with its label) and assume the generated one preserves an equivalent or similar semantic meaning and thus, the same label. However, in practice, many of the generated test cases fail to preserve similar semantic meaning and are unnatural (\textit{e.g.}, grammar errors), which leads to a high false alarm rate and unnatural test cases. Our evaluation study finds that 44% of the test cases generated by the state-of-the-art (SOTA) approaches are false alarms. These test cases require extensive manual checking effort, and instead of improving NLP software, they can even degrade NLP software when utilized in model training. To address this problem, we propose AEON for \textbf{A}utomatic \textbf{E}valuation \textbf{O}f \textbf{N}LP test cases. For each generated test case, it outputs scores based on semantic similarity and language naturalness. We employ AEON to evaluate test cases generated by four popular testing techniques on five datasets across three typical NLP tasks. The results show that AEON aligns the best with human judgment. In particular, AEON achieves the best average precision in detecting semantic inconsistent test cases, outperforming the best baseline metric by 10%. In addition, AEON also has the highest average precision of finding unnatural test cases, surpassing the baselines by more than 15%. Moreover, model training with test cases prioritized by AEON leads to models that are more accurate and robust, demonstrating AEON’s potential in improving NLP software.

DOI

https://doi.org/10.1145/3533767.3534394

Jen-tse Huang

Johns Hopkins University

United States

Jianping Zhang

The Chinese University of Hong Kong

Wenxuan Wang

The Chinese University of Hong Kong

Pinjia He

The Chinese University of Hong Kong, Shenzhen

China

Yuxin Su

Sun Yat-sen University

Michael Lyu

The Chinese University of Hong Kong

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 21 Jul
Displayed time zone: Seoul change

16:20 - 17:40	Session 3-6: Neural Networks, Learning, NLP FTechnical Papers at ISSTA 2

16:20 20m Talk		AEON: A Method for Automatic Evaluation of NLP Test Cases Technical Papers Jen-tse Huang Johns Hopkins University, Jianping Zhang The Chinese University of Hong Kong, Wenxuan Wang The Chinese University of Hong Kong, Pinjia He The Chinese University of Hong Kong, Shenzhen, Yuxin Su Sun Yat-sen University, Michael Lyu The Chinese University of Hong Kong DOI
16:40 20m Talk		HybridRepair: Towards Annotation-Efficient Repair for Deep Learning Models Technical Papers Yu Li The Chinese University of Hong Kong, Muxi Chen The Chinese University of Hong Kong, Xu, Qiang DOI
17:00 20m Talk		Improving Cross-Platform Binary Analysis using Representation Learning via Graph Alignment Technical Papers Geunwoo Kim University of California, Irvine, USA, Sanghyun Hong Oregon State University, Michael Franz University of California, Irvine, Dokyung Song Yonsei University, South Korea DOI
17:20 20m Talk		Human-in-the-Loop Oracle Learning for Semantic Bugs in String Processing Programs Technical Papers Charaka Geethal Monash University, Thuan Pham The University of Melbourne, Aldeida Aleti Monash University, Marcel Böhme MPI-SP, Germany and Monash University, Australia DOI Pre-print

Fri 22 Jul
Displayed time zone: Seoul change

00:00 - 01:00	Session 1-10: Neural Networks, Learning, NLP ATechnical Papers at ISSTA 2

00:00 20m Talk		AEON: A Method for Automatic Evaluation of NLP Test Cases Technical Papers Jen-tse Huang Johns Hopkins University, Jianping Zhang The Chinese University of Hong Kong, Wenxuan Wang The Chinese University of Hong Kong, Pinjia He The Chinese University of Hong Kong, Shenzhen, Yuxin Su Sun Yat-sen University, Michael Lyu The Chinese University of Hong Kong DOI
00:20 20m Talk		Simple Techniques Work Surprisingly Well for Neural Network Test Prioritization and Active Learning (Replicability Study) Technical Papers Michael Weiss Università della Svizzera italiana (USI), Paolo Tonella USI Lugano DOI Pre-print
00:40 20m Talk		ε-weakened Robustness of Deep Neural Networks Technical Papers Pei Huang State Key Laboratory of Computer Science, Institution of Software, Chinese Academy of Sciences, Yuting Yang Institute of Computing Technology,Chinese Academy of Sciences; University of Chinese Academy of Sciences, Minghao Liu Institute of Software, Chinese Academy of Sciences, Fuqi Jia State Key Laboratory of Computer Science, Institution of Software, Chinese Academy of Sciences, Feifei Ma Institute of Software, Chinese Academy of Sciences, Jian Zhang Institute of Software at Chinese Academy of Sciences, China DOI