This program is tentative and subject to change.
Large language models (LLMs) can perform a variety of tasks given an input prompt that contains a description of the task. In an attempt to enhance the performance and capabilities of LLMs, recent research has focused on augmenting LLMs with external tools, such as Python functions, REST APIs, and other deep learning models. While much of the research on tool-augmented LLMs (TaLLMs) has been focused on improving their capabilities, research on understanding and characterizing the kinds of failures that can occur in these systems is lacking. To address this gap, this paper proposes a taxonomy of failures in TaLLMs and their root causes, details an analysis of the failures that occur in two published TaLLMs (Gorilla and Chameleon), and provides recommendations on fault localization and repair of TaLLMs.
This program is tentative and subject to change.
Tue 29 AprDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 30mFull-paper | Adaptive Probabilistic Operational Testing for Large Language Models Evaluation AST 2025 Ali Asgari TU Delft, Antonio Guerriero Università di Napoli Federico II, Roberto Pietrantuono Università di Napoli Federico II, Stefano Russo Università di Napoli Federico II | ||
14:30 30mFull-paper | ASTRAL: Automated Safety Testing of Large Language Models AST 2025 Miriam Ugarte Mondragon University, Pablo Valle Mondragon University, José Antonio Parejo Maestre University of Seville, Sergio Segura University of Seville, Aitor Arrieta Mondragon University Pre-print | ||
15:00 30mFull-paper | A Taxonomy of Failures in Tool-Augmented LLMs AST 2025 |