Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks
This program is tentative and subject to change.
Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50% on average. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
16:00 - 17:00 | |||
16:00 10mTalk | A Characterization Study of Bugs in LLM Agent Workflow Orchestration Frameworks Industry Showcase Ziluo Xue Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Shenao Wang Huazhong University of Science and Technology, Kai Chen Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology | ||
16:10 10mTalk | Debugging the Undebuggable: Why Multi-Fault Programs Break Debugging and Repair Tools NIER Track Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI) | ||
16:20 10mTalk | ErrorPrism: Reconstructing Error Propagation Paths in Cloud Service Systems Industry Showcase Junsong Pu School of Software Engineering, Sun Yat-sen University, Yichen LI ByteDance, Zhuangbin Chen Sun Yat-sen University, Jinyang Liu ByteDance, Zhihan Jiang The Chinese University of Hong Kong, Jianjun Chen Bytedance, Rui Shi Bytedance, Zibin Zheng Sun Yat-sen University, Tieying Zhang ByteDance | ||
16:30 10mTalk | Fault Injection for Simulink-based CPS Models: Insights and Future Directions NIER Track Drishti Yadav University of Luxembourg, Luxembourg, Claudio Mandrioli University of Luxembourg, Ezio Bartocci TU Wien, Domenico Bianculli University of Luxembourg | ||
16:40 10mTalk | How Does ChatGPT Make Assumptions When Creating Erroneous Programs? NIER Track | ||
16:50 10mTalk | Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks NIER Track Ruofan Lu The Chinese University of Hong Kong, Yichen LI ByteDance, Yintong Huo Singapore Management University, Singapore | ||