SANER 2026
Tue 17 - Fri 20 March 2026 Limassol, Cyprus

Large Language Model (LLM)-based applications are increasingly deployed across domains such as customer service, education, and mobility, yet their robustness remains poorly understood. These systems are prone to inaccurate, fictitious, or harmful responses, and the vast, high-dimensional input space makes systematic testing particularly difficult.

We present STELLAR, an automated search-based testing framework for LLM-based applications that systematically uncovers text inputs leading to inappropriate system responses. The approach models test generation as an optimization problem and discretizes the input space into stylistic, content-related, and perturbation features. Guided by this discretization, an evolutionary algorithm explores feature combinations to generate and mutate textual inputs that are more likely to expose failures. Unlike prior work that focuses on prompt optimization or coverage heuristics, our framework leverages feature discretization to improve both the effectiveness and efficiency of test generation.

We evaluate our approach on two representative case studies of conversational question-answering systems. The first case study benchmarks public and proprietary LLMs on handling safety-critical and malicious user inputs. The second case study is an industrial automotive case study where a retrieval-augmented conversational system provides venue recommendations in the car. Across both studies, our framework uncovers more categories of failures and a substantially larger number of failing inputs than state-of-the-art baselines. These results show that search-based testing with feature discretization is an effective and generalizable strategy for assessing the robustness of LLM-based applications.

Preprint (stellar.pdf)572KiB