On the evaluation of test suites generated by large language models (ICTSS 2025 - General Track)

Who

Matej Cuze, Franz Wotawa

Track

ICTSS 2025 General Track

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Sep 2025 16:00 - 16:30 at Atrium C - Automated Test Generation and AI-Driven Testing Chair(s): Tolgahan Bardakci

Abstract

Automating test suite generation requires knowledge about the system and the underlying test case generation approach. Querying a smart Chatbot like ChatGPT using the program under tests for obtaining test cases is an appealing alternative for reducing costs and effort. However, the quality of the obtained test suite in terms of code coverage or mutation score might be questionable. Hence, an experimental evaluation focusing on the quality metrics of resulting test suites is important. In this paper, we provide such an experimental evaluation considering 12 Python programs and five different Large Language Models including ChatGPT4o. We measure the statement and line coverage and the mutation score of the resulting test suite. Furthermore, we vary the temperature used by a Large Language Model and the prompting strategy. Besides reporting on the corresponding research questions, we also provide a comparison with already published similar studies.

Matej Cuze

Graz University of Technology

Austria

Franz Wotawa

Technische Universitaet Graz