Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application Testing
Best Paper Candidate
This paper presents a comprehensive framework for testing and evaluating quality characteristics of Large Language Model (LLM) systems enhanced with Retrieval-Augmented Generation (RAG) in tourism applications. Through systematic empirical evaluation of three different LLM variants across multiple parameter configurations, we demonstrate the effectiveness of our testing methodology in assessing both functional correctness and extra-functional properties. Our framework implements 17 distinct metrics that encompass syntactic analysis, semantic evaluation, and behavioral evaluation through LLM judges. The study reveals significant information about how different architectural choices and parameter configurations affect system performance, particularly highlighting the impact of temperature and top-p parameters on response quality. The tests were carried out on a tourism recommendation system for the Värmland region, utilizing standard and RAG-enhanced configurations. The results indicate that the newer LLM versions show modest improvements in performance metrics, though the differences are more pronounced in response length and complexity rather than in semantic quality. The research contributes practical insights for implementing robust testing practices in LLM-RAG systems, providing valuable guidance to organizations deploying these architectures in production environments.
Mon 31 MarDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
11:00 - 12:30 | |||
11:00 30mTalk | Quality Assurance for LLM-RAG Systems: Empirical Insights from Tourism Application TestingBest Paper Candidate ITEQS Bestoun S. Ahmed Karlstad University, Ludwig Otto Baader Ludwig Maximilians University Munich, Firas Bayram Karlstad University, Siri Jagstedt Karlstad University, Peter Magnusson Karlstad University | ||
11:30 30mTalk | Using Reinforcement Learning for Security Testing: A Systematic Mapping Study ITEQS Tanwir Ahmad Åbo Akademi University, Matko Butkovic Åbo Akademi University, Dragos Truscan Åbo Akademi University | ||
12:00 30mTalk | Visual spectrum-based fault localization for Python programs based on the differentiation of execution slices ITEQS Shehroz Khan Åbo Akademi University, Gaadha Sudheerbabu Åbo Akademi University, Bianca Elena Staicu Åbo Akademi University, Tanwir Ahmad Åbo Akademi University, Dragos Truscan Åbo Akademi University |