Safeguarding LLM-Applications: Specify or Train? (CAIN 2025 - Posters)

Who

Hala Abdelkader, Mohamed Abdelrazek, Sankhya Singh, Irini Logothetis, Priya Rani, Rajesh Vasa, Jean-Guy Schneider

Track

CAIN 2025 Posters

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 10:15 - 10:18 at 208 - Lightning talks Chair(s): Scott Barnett

Abstract

Large Language Models (LLMs) are powerful tools used in several applications such as conversational AI, and code generation. However, significant robustness concerns arise with LLMs in production, such as hallucinations, prompt injection attacks, harmful content generation, and challenges in maintaining accurate domain-specific content moderation. Guardrails aim to mitigate these challenges by aligning LLM outputs with desired behaviors without modifying the underlying models. Nvidia NeMo Guardrails, for instance, rely on specifying acceptable/unacceptable behaviours. However, it is challenging to predict and address potential issues of LLMs in advance to create these guardrails. Also, manual updates from software engineers are often required to maintain and refine these guardrails. We introduce LLM-Guards, specialised machine learning (ML) models trained to function as protective guards. Additionally, we present an automation pipeline for training and continual fine-tuning of these guards using reinforcement learning from human feedback (RLHF). We evaluated several small LLMs, including Llama-3, Mistral, and Gemma, as LLM-Guards for challenges such as moderation and detecting off-topic queries, and compared their performance against NeMo Guardrails. The proposed Llama-3 LLM-Guard outperformed NeMo Guardrails in detecting off-topic queries, achieving an accuracy of 98.7% compared to 81%. Furthermore, the LLM-Guard detected 97.86% of harmful queries, surpassing NeMo Guardrails by 19.86%.

Hala Abdelkader

Applied Artificial Intelligence Institute, Deakin University

Mohamed Abdelrazek

Deakin University, Australia

Australia

Sankhya Singh

Deakin University

Irini Logothetis

Applied Artificial Intelligence Institute, Deakin University

Priya Rani

RMIT University

Rajesh Vasa

Deakin University, Australia

Australia

Jean-Guy Schneider

Monash University

Australia

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 28 Apr
Displayed time zone: Eastern Time (US & Canada) change

10:00 - 10:30	Lightning talksPosters at 208 Chair(s): Scott Barnett Deakin University, Australia

10:00 3m Poster		All You Need is an AI Platform: A Proposal for a Complete Reference Architecture Posters Benjamin Weigell University of Augsburg, Fabian Stieler University of Augsburg, Bernhard Bauer University of Augsburg
10:03 3m Poster		Evaluating Reinforcement Learning Safety and Trustworthiness in Cyber-Physical Systems Posters Katherine R. Dearstyne University of Notre Dame, Pedro Alarcon Granadeno University of Notre Dame, Theodore Chambers University of Notre Dame, Jane Cleland-Huang University of Notre Dame
10:06 3m Poster		Finding Trojan Triggers in Code LLMs: An Occlusion-based Human-in-the-loop Approach Posters Aftab Hussain Texas A&M University, College Station, Rafiqul Rabin UL Research Institutes, Toufique Ahmed IBM Research, Amin Alipour University of Houston, Bowen Xu North Carolina State University, Stephen Huang University of Houston Pre-print
10:09 3m Poster		Navigating the Shift: Architectural Transformations and Emerging Verification Demands in AI-Enabled Cyber-Physical Systems Posters Hadiza Yusuf University of Michigan - Dearborn, Khouloud Gaaloul University of Michigan - Dearborn
10:12 3m Poster		Random Perturbation Attacks on LLMs for Code Generation Posters Qiulu Peng Carnegie Mellon University, Chi Zhang , Ravi Mangal Colorado State University, Corina S. Păsăreanu Carnegie Mellon University; NASA Ames, Limin Jia Carnegie Mellon University
10:15 3m Poster		Safeguarding LLM-Applications: Specify or Train? Posters Hala Abdelkader Applied Artificial Intelligence Institute, Deakin University, Mohamed Abdelrazek Deakin University, Australia, Sankhya Singh Deakin University, Irini Logothetis Applied Artificial Intelligence Institute, Deakin University, Priya Rani RMIT University, Rajesh Vasa Deakin University, Australia, Jean-Guy Schneider Monash University
10:18 3m Poster		Task decomposition and RAG as Design Patterns for LLM-based Systems Posters Orlando Marquez Ayala ServiceNow