StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
This program is tentative and subject to change.
Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ∼94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.
This program is tentative and subject to change.
Thu 9 JulDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 20mTalk | StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis Research Papers Jiayi Mao Tsinghua University, Liqun Li Microsoft Research, Yanjie Gao Microsoft Research, Zegang Peng Tsinghua University, Shilin He Microsoft Research, Chaoyun Zhang Microsoft, Si Qin Microsoft Research, Samia Khalid Microsoft, Qingwei Lin Microsoft, Saravan Rajmohan Microsoft, Sitaram Lanka Microsoft, Dongmei Zhang Microsoft | ||
14:20 20mTalk | Spectrum-based Failure Attribution for Multi-Agent Systems Research Papers Yu Ge Nanjing University, Linna Xie Nanjing University, Zhong Li Nanjing University, Yu Pei Hong Kong Polytechnic University, Tian Zhang Nanjing University | ||
14:40 10mTalk | RuntimeSlicer: Towards Generalizable Unified Runtime State Representation for Failure Management Ideas, Visions and Reflections Lingzhe Zhang Peking University, China, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Weijie Hong Peking university, Mingyu Wang Peking University, Chiming Duan Peking University, Minghua He Peking University, Rongqian Wang Huawei Theory Lab, Xi Peng Huawei Theory Lab, Meiling Wang Huawei America Lab, Nicholas Zhang Huawei Theory Lab, Renhai Chen Huawei Theory Lab, Ying Li School of Software and Microelectronics, Peking University, Beijing, China | ||
14:50 10mTalk | Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation Ideas, Visions and Reflections Lingzhe Zhang Peking University, China, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Mingyu Wang Peking University, Weijie Hong Peking university, Chiming Duan Peking University, Minghua He Peking University, Rongqian Wang Huawei Theory Lab, Xi Peng Huawei Theory Lab, Meiling Wang Huawei America Lab, Nicholas Zhang Huawei Theory Lab, Renhai Chen Huawei Theory Lab, Ying Li School of Software and Microelectronics, Peking University, Beijing, China | ||
15:00 20mTalk | FaultWeave: Bounded Resilience Testing with Failure Diagnosis Capability for Microservice Applications Industry Papers Mingzhuo Zheng Institute of Software, Chinese Academy of Sciences, Guoquan Wu Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences; University of Chinese Academy of Sciences Nanjing College; China Southern Power Grid, Jinbo Zhang Information Center, Guangdong Power Grid, Jun Wei Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Wei Chen Institute of Software at Chinese Academy of Sciences, Jiaxin Zhu Institute of Software at Chinese Academy of Sciences, Zheheng Liang Joint Laboratory on Cyberspace Security of China Southern Power Grid | ||
15:20 10mTalk | From Syntactic to Semantic Spectra for Fault Localization Ideas, Visions and Reflections Zhaorui Yang University of California, Riverside, Qian Zhang University of California at Riverside, Rajiv Gupta University of California at Riverside, Ashish Kundu Cisco Research | ||