ASE 2023
Mon 11 - Fri 15 September 2023 Kirchberg, Luxembourg
Tue 12 Sep 2023 10:30 - 10:42 at Room C - Testing AI Systems 1 Chair(s): Leonardo Mariani

Automated detection of software failures is an important but challenging software engineering task. It involves finding in a vast search space the failure-inducing test cases that contain an input triggering the software fault and an oracle asserting the incorrect execution. We are motivated to study how far this outstanding challenge can be solved by recent advances in large language models (LLMs) such as ChatGPT. However, our study reveals that ChatGPT has a relatively low success rate (28.8%) in finding correct failure-inducing test cases for buggy programs. A possible conjecture is that finding failure-inducing test cases requires analyzing the subtle differences (nuances) between the tokens of a program’s correct version and those for its buggy version. When these two versions have similar sets of tokens and attentions, ChatGPT is weak in distinguishing their differences.

We find that ChatGPT can successfully generate failure-inducing test cases when it is guided to focus on the nuances. Our solution is inspired by an interesting observation that ChatGPT could infer the intended functionality of buggy code if it is similar to the correct version. Driven by the inspiration, we develop a novel technique, called Differential Prompting, to effectively find failure-inducing test cases with the help of the compilable code synthesized by the inferred intention. Prompts are constructed based on the nuances between the given version and the synthesized code. We evaluate Differential Prompting on QuixBugs (a popular benchmark of buggy programs) and recent programs published at Codeforces (a popular programming contest portal, which is also an official benchmark of ChatGPT). We compare Differential Prompting with two baselines constructed using conventional ChatGPT prompting and PYNGUIN (the state-of- the-art unit test generation tool for Python programs). Our evaluation results show that for programs of QuixBugs, Differential Prompting can achieve a success rate of 75.0% in finding failure- inducing test cases, outperforming the best baseline by 2.6X. For programs of Codeforces, Differential Prompting’s success rate is 66.7%, outperforming the best baseline by 4.0X.

Tue 12 Sep

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

10:30 - 12:00
Testing AI Systems 1NIER Track / Research Papers at Room C
Chair(s): Leonardo Mariani University of Milano-Bicocca
10:30
12m
Talk
Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting
Research Papers
Li Tsz On The Hong Kong University of Science and Technology, Wenxi Zong Northeastern University, Yibo Wang Northeastern University, Haoye Tian University of Luxembourg, Ying Wang Northeastern University, Shing-Chi Cheung Hong Kong University of Science and Technology, Jeffrey Kramer Imperial College London
Pre-print
10:42
12m
Talk
SOCRATEST- Towards Autonomous Testing Agents via Conversational Large Language Models
NIER Track
Robert Feldt Chalmers University of Technology, Sweden, Sungmin Kang KAIST, Juyeon Yoon Korea Advanced Institute of Science and Technology, Shin Yoo KAIST
Pre-print File Attached
10:54
12m
Research paper
Semantic Data Augmentation for Deep Learning Testing using Generative AI
NIER Track
sondess missaoui University of York, Simos Gerasimou University of York, Nicholas Matragkas Université Paris-Saclay, CEA, List.
File Attached
11:06
12m
Talk
Robin: A Novel Method to Produce Robust Interpreters for Deep Learning-Based Code Classifiers
Research Papers
Zhen Li Huazhong University of Science and Technology, Ruqian Zhang Huazhong University of Science and Technology, Deqing Zou Huazhong University of Science and Technology, Ning Wang Huazhong University of Science and Technology, Yating Li Huazhong University of Science and Technology, Shouhuai Xu University of Colorado Colorado Springs, Chen Chen University of Central Florida, Hai Jin Huazhong University of Science and Technology, Yating Li Huazhong University of Science and Technology
Pre-print
11:18
12m
Talk
The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models
Research Papers
Xin Zhou Singapore Management University, Singapore, Kisub Kim Singapore Management University, Singapore, Bowen Xu North Carolina State University, Jiakun Liu Singapore Management University, DongGyun Han Royal Holloway, University of London, David Lo Singapore Management University
Pre-print
11:30
12m
Talk
CertPri: Certifiable Prioritization for Deep Neural Networks via Movement Cost in Feature SpaceRecorded talk
Research Papers
haibin zheng Zhejiang University of Technology, Jinyin Chen College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China, Haibo Jin Zhejiang University of Techonology
Pre-print Media Attached