ICSE 2024
Fri 12 - Sun 21 April 2024 Lisbon, Portugal

In the wake of developments in the field of Natural Language Processing, Question Answering (QA) software has penetrated our daily life. Due to the data-driven programming paradigm, QA software inevitably contains bugs, i.e., misbehaving in real-world applications. Current testing techniques for testing QA software include two folds, reference-based testing and metamorphic testing.

This paper adopts a different angle to achieve testing for QA software: we notice that answers to questions would have inference relations, i.e., the answers to some questions could be logically inferred from the answers to other questions. If these answers on QA software do not satisfy the inference relations, an inference bug is detected. To generate the questions with the inference relations automatically, we propose a novel testing method Knowledge Graph driven Inference Testing (KGIT), which employs facts in the Knowledge Graph (KG) as the seeds to logically construct test cases containing questions and contexts with inference relations. To evaluate the effectiveness of KGIT, we conduct an extensive empirical study with more than 2.8 million test cases generated from the large-scale KG YAGO4 and three QA models based on the state-of-the-art QA model structure. The experimental results show that our method (a) could detect a considerable number of inference bugs in all three studied QA models and (b) is helpful in retraining QA models to improve their inference ability.