AST 2025
Sat 26 April - Sun 4 May 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025
Tue 29 Apr 2025 14:00 - 14:30 at 211 - Session 5: Testing of LLMs

Abstract—Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic oper- ational testing for effective and efficient LLM evaluation. To this aim, we adopt an existing framework (DeepSample) for DNN testing, and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive case study, we show how sampling- based operational testing can be used, depending on the tester’s needs, to yield reliable LLM accuracy estimates, to effectively expose LLM failures, or to balance multiple evaluation objectives under testing budget constraints. The comprehensive evaluation with a popular LLM model on three sentiment analysis datasets shows that sampling-based methods can provide effective and efficient operational accuracy assessment of LLM, thereby bridging critical gaps in current LLM quality assessment practices. Practical implications for testers are drawn based on this experimental evaluation

Tue 29 Apr

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
Session 5: Testing of LLMsAST 2025 at 211

Session chair: Annibale Panichella

14:00
30m
Full-paper
Adaptive Probabilistic Operational Testing for Large Language Models Evaluation
AST 2025
Ali Asgari TU Delft, Antonio Guerriero Università di Napoli Federico II, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II
Pre-print
14:30
30m
Full-paper
ASTRAL: Automated Safety Testing of Large Language Models
AST 2025
Miriam Ugarte Mondragon University, Pablo Valle Mondragon University, José Antonio Parejo Maestre SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Aitor Arrieta Mondragon University
Pre-print
15:00
30m
Full-paper
A Taxonomy of Failures in Tool-Augmented LLMs
AST 2025
Cailin Winston University of Washington, René Just University of Washington
:
:
:
: