RAGProbe: Breaking RAG Pipelines with Evaluation Scenarios (CAIN 2025 - Research and Experience Papers)

Who

Shangeetha Sivasothy, Scott Barnett, Stefanus Kurniawan, Zafaryab Rasool, Rajesh Vasa

Track

CAIN 2025 Research and Experience Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 27 Apr 2025 14:15 - 14:30 at 208 - Architecting and Testing AI Systems Chair(s): Jan-Philipp Steghöfer

Abstract

Retrieval Augmented Generation (RAG) is increasingly employed in building Generative AI applications, yet their evaluation often relies on manual, trial-and-error processes. Automating this evaluation process involves generating test data to trigger failures involving context comprehension, data formatting, specificity, and content completeness. Random question-answer generation is insufficient. However, prior works rely on standard QA datasets, benchmarks and tactics that are not tailored to the specific domain requirements. Hence, current approaches and datasets do not trigger sufficiently broad and context-specific failures. In this paper, we introduce evaluation scenarios that describe the process of generating question-answer pairs from content indexed by RAG pipelines, and they are designed to trigger a wider range of failures and to simplify automation. This enables developers to identify and address weaknesses more effectively. We validate our approach on five open-source RAG pipelines using three datasets. Our approach triggers high failure rates, by generating prompts that combine multiple questions (up to 91% failure rate) highlighting the need for developers to prioritize handling such queries. We generated failure rates of 60% in an academic domain dataset and 53% and 64% in open-domain datasets. Compared to existing state-of-the-art methods, our approach triggers 77% more failures on average per RAG pipeline and 53% more failures on average per dataset, offering a mechanism to support developers to improve the RAG pipeline quality.

Shangeetha Sivasothy

Applied Artificial Intelligence Institute, Deakin University

Scott Barnett

Deakin University, Australia

Australia

Stefanus Kurniawan

Deakin University

Zafaryab Rasool