Adaptive Probabilistic Operational Testing for Large Language Models Evaluation (AST 2025)

Who

Ali Asgari, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo

Track

AST 2025

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Apr 2025 14:00 - 14:30 at 211 - Session 5: Testing of LLMs

Abstract

Abstract—Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic oper- ational testing for effective and efficient LLM evaluation. To this aim, we adopt an existing framework (DeepSample) for DNN testing, and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive case study, we show how sampling- based operational testing can be used, depending on the tester’s needs, to yield reliable LLM accuracy estimates, to effectively expose LLM failures, or to balance multiple evaluation objectives under testing budget constraints. The comprehensive evaluation with a popular LLM model on three sentiment analysis datasets shows that sampling-based methods can provide effective and efficient operational accuracy assessment of LLM, thereby bridging critical gaps in current LLM quality assessment practices. Practical implications for testers are drawn based on this experimental evaluation

Link to Preprint

https://www.iris.unina.it/retrieve/41b20807-3e13-4bd5-8d91-7183a0e42439/DeepSampleLLM.pdf

Ali Asgari

TU Delft

Antonio Guerriero

Università di Napoli Federico II

Roberto Pietrantuono

Università di Napoli Federico II

Stefano Russo

Università di Napoli Federico II

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 29 Apr
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Session 5: Testing of LLMsAST 2025 at 211 Session chair: Annibale Panichella

14:00 30m Full-paper		Adaptive Probabilistic Operational Testing for Large Language Models Evaluation AST 2025 Ali Asgari TU Delft, Antonio Guerriero Università di Napoli Federico II, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II Pre-print
14:30 30m Full-paper		ASTRAL: Automated Safety Testing of Large Language Models AST 2025 Miriam Ugarte Mondragon University, Pablo Valle Mondragon University, José Antonio Parejo Maestre SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Aitor Arrieta Mondragon University Pre-print
15:00 30m Full-paper		A Taxonomy of Failures in Tool-Augmented LLMs AST 2025 Cailin Winston University of Washington, René Just University of Washington