A Taxonomy of Failures in Tool-Augmented LLMs (AST 2025)

Who

Cailin Winston, René Just

Track

AST 2025

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 29 Apr 2025 15:00 - 15:30 at 211 - Session 5: Testing of LLMs

Abstract

Large language models (LLMs) can perform a variety of tasks given an input prompt that contains a description of the task. In an attempt to enhance the performance and capabilities of LLMs, recent research has focused on augmenting LLMs with external tools, such as Python functions, REST APIs, and other deep learning models. While much of the research on tool-augmented LLMs (TaLLMs) has been focused on improving their capabilities, research on understanding and characterizing the kinds of failures that can occur in these systems is lacking. To address this gap, this paper proposes a taxonomy of failures in TaLLMs and their root causes, details an analysis of the failures that occur in two published TaLLMs (Gorilla and Chameleon), and provides recommendations on fault localization and repair of TaLLMs.

Cailin Winston

University of Washington

René Just