On the evaluation of test suites generated by large language models
Automating test suite generation requires knowledge about the system and the underlying test case generation approach. Querying a smart Chatbot like ChatGPT using the program under tests for obtaining test cases is an appealing alternative for reducing costs and effort. However, the quality of the obtained test suite in terms of code coverage or mutation score might be questionable. Hence, an experimental evaluation focusing on the quality metrics of resulting test suites is important. In this paper, we provide such an experimental evaluation considering 12 Python programs and five different Large Language Models including ChatGPT4o. We measure the statement and line coverage and the mutation score of the resulting test suite. Furthermore, we vary the temperature used by a Large Language Model and the prompting strategy. Besides reporting on the corresponding research questions, we also provide a comparison with already published similar studies.
Wed 17 SepDisplayed time zone: Athens change
16:00 - 17:40 | Automated Test Generation and AI-Driven TestingGeneral Track at Atrium C Chair(s): Tolgahan Bardakci University of Antwerp and Flanders Make | ||
16:00 30mTalk | On the evaluation of test suites generated by large language models General Track | ||
16:30 30mTalk | On the use of imbalanced datasets for learning-based vulnerability detection General Track | ||
17:00 20mTalk | Tracing Vulnerability Propagation Across Open Source Software Ecosystems General Track Jukka Ruohonen University of Southern Denmark, Qusai Ramadan The Maersk Mc-Kinney Moller Institute, University of Southern Denmark | ||
17:20 20mTalk | Localization Testing in Video Games using Text Recognition General Track Guillermo Jimenez-Diaz Universidad Complutense de Madrid, Dewei Chen Universidad Complutense de Madrid | ||