Testing Your Question Answering Software via Asking Recursively
Question Answering (QA) is an attractive and challenging area in NLP community. There are diverse algorithms being proposed and various benchmark datasets with different topics and task formats being constructed. QA software has also been widely used in daily human life now. However, current QA software is mainly tested in a reference-based paradigm, in which the expected outputs (labels) of test cases need to be annotated with much human effort before testing. As a result, neither the just-in-time test during usage nor the extensible test on massive unlabeled real-life data is feasible, which keeps the current testing of QA software from being flexible and sufficient. In this paper, we propose a method, QAAskeR, with three novel Metamorphic Relations for testing QA software. QAAskeR does not require the annotated labels but tests QA software by checking its behaviors on multiple recursively asked questions that are related to the same knowledge. Experimental results show that QAAskeR can reveal violations at over 80% of valid cases without using any pre-annotated labels. Diverse answering issues, especially the limited generalization on question types across datasets, are revealed on a state-of-the-art QA algorithm.
Tue 16 NovDisplayed time zone: Hobart change
23:00 - 00:00 | Artefacts Plenary (Any Day Band 2)Artifact Evaluation at Kangaroo Chair(s): Aldeida Aleti Monash University, Tim Menzies North Carolina State University | ||
23:00 5mDay opening | Opening Artifact Evaluation | ||
23:05 7mKeynote | Keynote Artifact Evaluation Dirk Beyer LMU Munich, Germany | ||
23:12 3mTalk | CiFi: Versatile Analysis of Class and Field Immutability Artifact Evaluation Tobias Roth Technische Universität Darmstadt, Dominik Helm Technische Universität Darmstadt, Michael Reif Technische Universität Darmstadt, Mira Mezini Technische Universität Darmstadt | ||
23:15 3mTalk | Testing Your Question Answering Software via Asking Recursively Artifact Evaluation Songqiang Chen School of Computer Science, Wuhan University, Shuo Jin School of Computer Science, Wuhan University, Xiaoyuan Xie School of Computer Science, Wuhan University, China | ||
23:18 3mTalk | Restoring the Executability of Jupyter Notebooks by Automatic Upgrade of Deprecated APIs Artifact Evaluation Chenguang Zhu University of Texas at Austin, Ripon Saha Fujitsu Laboratories of America, Inc., Mukul Prasad Fujitsu Research of America, Sarfraz Khurshid The University of Texas at Austin | ||
23:21 3mTalk | Context Debloating for Object-Sensitive Pointer Analysis Artifact Evaluation | ||
23:24 3mTalk | Understanding and Detecting Performance Bugs in Markdown Compilers Artifact Evaluation Penghui Li The Chinese University of Hong Kong, Yinxi Liu The Chinese University of Hong Kong, Wei Meng Chinese University of Hong Kong | ||
23:27 5mProduct release | Reuse graphs Artifact Evaluation | ||
23:32 10mTalk | Most reused artefacts Artifact Evaluation | ||
23:42 18mLive Q&A | Discussion Artifact Evaluation |