"Is It Responsible?" Emerging Results on Comparing Guardrails for Harm Mitigation in LLM-enhanced Software Applications
This program is tentative and subject to change.
The rapid adoption of Large Language Models (LLMs) in the engineering of software applications, such as customer service chatbots, has brought significant benefits, but has also posed substantial risks. Generation of biased, inappropriate, or harmful responses are among the potential problems that can arise when using LLM as a COTS, connecting its chat to the user interface of a software application. This paper brings emerging results of an exploratory study that compares commercial guardrail frameworks to evaluate their ability to retain unappropriate content during a chat conversation. We empirically evaluate three guardrail frameworks — LLM Guard, Llama Guard, and OpenAI Moderation — against two datasets of toxic and offensive content. Results show that improvements are still needed, since the assessed guardrails frameworks achieved high accuracy for one of the datasets (more than 90%) but underperformed in other metrics, showing that toxic or dangerous content could still be delivered for users if this is deployed in a chatbot, for instance. We hope these results assist researchers and practitioners in selecting appropriate guardrails to improve harm mitigation in LLM-based applications.