ASTRAL: Automated Safety Testing of Large Language Models (AST 2025)

Who

Miriam Ugarte, Pablo Valle, José Antonio Parejo Maestre, Sergio Segura, Aitor Arrieta

Track

AST 2025

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Apr 2025 14:30 - 15:00 at 211 - Session 5: Testing of LLMs

Abstract

Large Language Models (LLMs) have recently gained significant attention due to their ability to understand and generate sophisticated human-like content. However, ensuring their safety is paramount as they might provide harmful and unsafe responses. Existing LLM testing frameworks address various safety-related concerns (e.g., drugs, terrorism, animal abuse) but often face challenges due to unbalanced and obsolete datasets. In this paper, we present \tool, a tool that automates the generation and execution of test cases (i.e., prompts) for testing the safety of LLMs. First, we introduce a novel black-box coverage criterion to generate balanced and diverse unsafe test inputs across a diverse set of safety categories as well as linguistic writing characteristics (i.e. different style and persuasive writing techniques). Second, we propose an LLM-based approach that leverages Retrieval Augmented Generation (RAG), few-shot prompting strategies and web browsing to generate up-to-date test inputs. Lastly, similar to current LLM test automation techniques, we leverage LLMs as test oracles to distinguish between safe and unsafe test outputs, allowing a fully automated testing approach. We conduct an extensive evaluation on well-known LLMs, revealing the following key findings: i) GPT3.5 outperforms other LLMs when acting as the test oracle, accurately detecting unsafe responses, and even surpassing more recent LLMs (e.g., GPT-4), as well as LLMs that are specifically tailored to detect unsafe LLM outputs (e.g., LlamaGuard); ii) the results confirm that our approach can uncover nearly twice as many unsafe LLM behaviors with the same number of test inputs compared to currently used static datasets; and iii) our black-box coverage criterion combined with web browsing can effectively guide the LLM on generating up-to-date unsafe prompts, significantly increasing the number of test inputs that lead to an unsafe LLM output.

Link to Preprint

https://arxiv.org/pdf/2501.17132

Miriam Ugarte

Mondragon University

Pablo Valle

Mondragon University

Spain

José Antonio Parejo Maestre

SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain

Spain

Sergio Segura

SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain

Spain

Aitor Arrieta

Mondragon University

Spain

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 29 Apr
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Session 5: Testing of LLMsAST 2025 at 211 Session chair: Annibale Panichella

14:00 30m Full-paper		Adaptive Probabilistic Operational Testing for Large Language Models Evaluation AST 2025 Ali Asgari TU Delft, Antonio Guerriero Università di Napoli Federico II, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II Pre-print
14:30 30m Full-paper		ASTRAL: Automated Safety Testing of Large Language Models AST 2025 Miriam Ugarte Mondragon University, Pablo Valle Mondragon University, José Antonio Parejo Maestre SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Sergio Segura SCORE Lab, I3US Institute, Universidad de Sevilla, Seville, Spain, Aitor Arrieta Mondragon University Pre-print
15:00 30m Full-paper		A Taxonomy of Failures in Tool-Augmented LLMs AST 2025 Cailin Winston University of Washington, René Just University of Washington