Adaptive Probabilistic Operational Testing for Large Language Models Evaluation
Abstract—Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic oper- ational testing for effective and efficient LLM evaluation. To this aim, we adopt an existing framework (DeepSample) for DNN testing, and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive case study, we show how sampling- based operational testing can be used, depending on the tester’s needs, to yield reliable LLM accuracy estimates, to effectively expose LLM failures, or to balance multiple evaluation objectives under testing budget constraints. The comprehensive evaluation with a popular LLM model on three sentiment analysis datasets shows that sampling-based methods can provide effective and efficient operational accuracy assessment of LLM, thereby bridging critical gaps in current LLM quality assessment practices. Practical implications for testers are drawn based on this experimental evaluation
Tue 29 AprDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 30mFull-paper | Adaptive Probabilistic Operational Testing for Large Language Models Evaluation AST 2025 Ali Asgari TU Delft, Antonio Guerriero Università di Napoli Federico II, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II Pre-print | ||
14:30 30mFull-paper | ASTRAL: Automated Safety Testing of Large Language Models AST 2025 Miriam Ugarte Mondragon University, Pablo Valle Mondragon University, José Antonio Parejo Maestre SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Aitor Arrieta Mondragon University Pre-print | ||
15:00 30mFull-paper | A Taxonomy of Failures in Tool-Augmented LLMs AST 2025 |