Safeguarding LLM-Applications: Specify or Train?
This program is tentative and subject to change.
Large Language Models (LLMs) are powerful tools used in several applications such as conversational AI, and code generation. However, significant robustness concerns arise with LLMs in production, such as hallucinations, prompt injection attacks, harmful content generation, and challenges in maintaining accurate domain-specific content moderation. Guardrails aim to mitigate these challenges by aligning LLM outputs with desired behaviors without modifying the underlying models. Nvidia NeMo Guardrails, for instance, rely on specifying acceptable/unacceptable behaviours. However, it is challenging to predict and address potential issues of LLMs in advance to create these guardrails. Also, manual updates from software engineers are often required to maintain and refine these guardrails. We introduce LLM-Guards, specialised machine learning (ML) models trained to function as protective guards. Additionally, we present an automation pipeline for training and continual fine-tuning of these guards using reinforcement learning from human feedback (RLHF). We evaluated several small LLMs, including Llama-3, Mistral, and Gemma, as LLM-Guards for challenges such as moderation and detecting off-topic queries, and compared their performance against NeMo Guardrails. The proposed Llama-3 LLM-Guard outperformed NeMo Guardrails in detecting off-topic queries, achieving an accuracy of 98.7% compared to 81%. Furthermore, the LLM-Guard detected 97.86% of harmful queries, surpassing NeMo Guardrails by 19.86%.