RAGProbe: Breaking RAG Pipelines with Evaluation ScenariosDistinguished paper Award Candidate
Retrieval Augmented Generation (RAG) is increasingly employed in building Generative AI applications, yet their evaluation often relies on manual, trial-and-error processes. Automating this evaluation process involves generating test data to trigger failures involving context comprehension, data formatting, specificity, and content completeness. Random question-answer generation is insufficient. However, prior works rely on standard QA datasets, benchmarks and tactics that are not tailored to the specific domain requirements. Hence, current approaches and datasets do not trigger sufficiently broad and context-specific failures. In this paper, we introduce evaluation scenarios that describe the process of generating question-answer pairs from content indexed by RAG pipelines, and they are designed to trigger a wider range of failures and to simplify automation. This enables developers to identify and address weaknesses more effectively. We validate our approach on five open-source RAG pipelines using three datasets. Our approach triggers high failure rates, by generating prompts that combine multiple questions (up to 91% failure rate) highlighting the need for developers to prioritize handling such queries. We generated failure rates of 60% in an academic domain dataset and 53% and 64% in open-domain datasets. Compared to existing state-of-the-art methods, our approach triggers 77% more failures on average per RAG pipeline and 53% more failures on average per dataset, offering a mechanism to support developers to improve the RAG pipeline quality.
Sun 27 AprDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | Architecting and Testing AI SystemsResearch and Experience Papers at 208 Chair(s): Jan-Philipp Steghöfer XITASO GmbH IT & Software Solutions | ||
14:00 15mTalk | How Do Model Export Formats Impact the Development of ML-Enabled Systems? A Case Study on Model IntegrationDistinguished paper Award Candidate Research and Experience Papers Shreyas Kumar Parida ETH Zurich, Ilias Gerostathopoulos Vrije Universiteit Amsterdam, Justus Bogner Vrije Universiteit Amsterdam Pre-print | ||
14:15 15mTalk | RAGProbe: Breaking RAG Pipelines with Evaluation ScenariosDistinguished paper Award Candidate Research and Experience Papers Shangeetha Sivasothy Applied Artificial Intelligence Institute, Deakin University, Scott Barnett Deakin University, Australia, Stefanus Kurniawan Deakin University, Zafaryab Rasool Applied Artificial Intelligence Institute, Deakin University, Rajesh Vasa Deakin University, Australia | ||
14:30 15mTalk | On-Device or Remote? On the Energy Efficiency of Fetching LLM-Generated Content Research and Experience Papers Vince Nguyen Vrije Universiteit Amsterdam, Hieu Huynh Vrije Universiteit Amsterdam, Vidya Dhopate Vrije Universiteit Amsterdam, Anusha Annengala Vrije Universiteit Amsterdam, Hiba Bouhlal Vrije Universiteit Amsterdam, Gian Luca Scoccia Gran Sasso Science Institute, Matias Martinez Universitat Politècnica de Catalunya (UPC), Vincenzo Stoico Vrije Universiteit Amsterdam, Ivano Malavolta Vrije Universiteit Amsterdam Pre-print Media Attached | ||
14:45 10mTalk | LoCoML: A Framework for Real-World ML Inference Pipelines Research and Experience Papers Kritin Maddireddy IIIT Hyderabad, Santhosh Kotekal Methukula IIIT Hyderabad, Chandrasekar S IIIT Hyderabad, Karthik Vaidhyanathan IIIT Hyderabad | ||
14:55 10mTalk | Towards Continuous Experiment-driven MLOps Research and Experience Papers Keerthiga Rajenthiram Vrije Universiteit Amsterdam, Milad Abdullah Charles University, Ilias Gerostathopoulos Vrije Universiteit Amsterdam, Petr Hnětynka Charles University, Tomas Bures Charles University, Czech Republic, Gerard Pons Universitat Politècnica de Catalunya, Barcelona, Spain, Besim Bilalli Universitat Politècnica de Catalunya, Barcelona, Spain, Anna Queralt Universitat Politècnica de Catalunya, Barcelona, Spain | ||
15:05 25mOther | Discussion Research and Experience Papers |