Adaptive Probabilistic Operational Testing for Large Language Models Evaluation
Abstract—Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic oper- ational testing for effective and efficient LLM evaluation. To this aim, we adopt an existing framework (DeepSample) for DNN testing, and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive case study, we show how sampling- based operational testing can be used, depending on the tester’s needs, to yield reliable LLM accuracy estimates, to effectively expose LLM failures, or to balance multiple evaluation objectives under testing budget constraints. The comprehensive evaluation with a popular LLM model on three sentiment analysis datasets shows that sampling-based methods can provide effective and efficient operational accuracy assessment of LLM, thereby bridging critical gaps in current LLM quality assessment practices. Practical implications for testers are drawn based on this experimental evaluation