Kitten: A Simple Yet Effective Baseline for Evaluating LLM-Based Compiler Testing Techniques
Compiler testing is critical and indispensable to improve the correctness of compilers. Spurred by recent advancements in Large Language Models (LLMs), LLM-based compiler testing techniques such as Fuzz4All, have demonstrated their potential in uncovering real bugs in diverse compilers and reducing the required engineering efforts in designing program generators. Given the continuous evolution of LLMs and the emergence of new LLM-based approaches, establishing robust baselines is crucial for rigorous evaluation and driving future advancements in this promising research direction.
To this end, we introduce Kitten, a mutation-based, language-agnostic program generator. Kitten leverages a corpus of seed programs, analogous to the training set for LLMs, and utilizes the target language’s syntax, akin to the knowledge learned by LLMs. Furthermore, Kitten’s mutation operators can generate diverse test programs, demonstrating a behavior analogous to the ability of LLM inference to generate new code.
Our evaluations demonstrate that, using existing compiler test suites as seed programs, Kitten outperforms Fuzz4All in terms of code coverage and bug detection capabilities. Within 24 hours, Kitten achieved 48.3%, 9.9%, 33.8% higher coverage than Fuzz4All on GCC, LLVM and Rustc, respectively, while also identifying 19.3 bugs in GCC, 20.3 bugs in LLVM and 15.7 in Rustc. Over the course of nine months dedicated to Kitten’s development and testing, we identified a total of 328 across the compilers GCC, LLVM, Rustc, Solc, JerryScript, scalac, and slang, of which 310 have been confirmed or fixed. We strongly believe that Kitten serves as an effective baseline, enabling the identification of limitations within existing LLM-based approaches and consequently driving advancements in this promising research direction.
Wed 25 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
14:00 - 15:30 | LLM-based Testing 1Research Papers / Tool Demonstrations at Cosmos 3A Chair(s): Qingkai Shi Nanjing University | ||
14:00 25mTalk | A Large-scale Empirical Study on Fine-tuning Large Language Models for Unit Testing Research Papers ye shang Nanjing University, Quanjun Zhang School of Computer Science and Engineering, Nanjing University of Science and Technology, Chunrong Fang Nanjing University, Siqi Gu Nanjing University, Jianyi Zhou Huawei Cloud Computing Technologies Co., Ltd., Zhenyu Chen Nanjing University DOI | ||
14:25 25mTalk | Validating Network Protocol Parsers with Traceable RFC Document Interpretation Research Papers Mingwei Zheng Purdue University, Danning Xie Purdue University, Qingkai Shi Nanjing University, Chengpeng Wang Purdue University, Xiangyu Zhang Purdue University DOI | ||
14:50 25mTalk | Tratto: A Neuro-Symbolic Approach to Deriving Axiomatic Test Oracles Research Papers Davide Molinelli USI Lugano; Schaffhausen Institute of Technology, Alberto Martin-Lopez Software Institute - USI, Lugano, Elliott Zackrone University of Washington, Beyza Eken Sakarya University, Michael D. Ernst University of Washington, Mauro Pezze Università della Svizzera italiana (USI) and Università degli Studi di Milano Bicocca and CIT Constructor Institute of Technology DOI Pre-print | ||
15:15 15mDemonstration | Kitten: A Simple Yet Effective Baseline for Evaluating LLM-Based Compiler Testing Techniques Tool Demonstrations Yuanmin Xie Tsinghua University, Zhenyang Xu University of Waterloo, Yongqiang Tian , Min Zhou , Xintong Zhou University of Waterloo, Chengnian Sun University of Waterloo |
Cosmos 3A is the first room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.