FSE 2026
Sun 5 - Thu 9 July 2026 Montreal, Canada

This program is tentative and subject to change.

Thu 9 Jul 2026 14:00 - 14:20 at MB 5.215 - Anomaly and failure 2

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist SREs in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution DAGs from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to guarantee correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ∼94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs.

This program is tentative and subject to change.

Thu 9 Jul

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
14:00
20m
Talk
StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
Research Papers
Jiayi Mao Tsinghua University, Liqun Li Microsoft Research, Yanjie Gao Microsoft Research, Zegang Peng Tsinghua University, Shilin He Microsoft Research, Chaoyun Zhang Microsoft, Si Qin Microsoft Research, Samia Khalid Microsoft, Qingwei Lin Microsoft, Saravan Rajmohan Microsoft, Sitaram Lanka Microsoft, Dongmei Zhang Microsoft
14:20
20m
Talk
Spectrum-based Failure Attribution for Multi-Agent Systems
Research Papers
Yu Ge Nanjing University, Linna Xie Nanjing University, Zhong Li Nanjing University, Yu Pei Hong Kong Polytechnic University, Tian Zhang Nanjing University
14:40
10m
Talk
RuntimeSlicer: Towards Generalizable Unified Runtime State Representation for Failure Management
Ideas, Visions and Reflections
Lingzhe Zhang Peking University, China, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Weijie Hong Peking university, Mingyu Wang Peking University, Chiming Duan Peking University, Minghua He Peking University, Rongqian Wang Huawei Theory Lab, Xi Peng Huawei Theory Lab, Meiling Wang Huawei America Lab, Nicholas Zhang Huawei Theory Lab, Renhai Chen Huawei Theory Lab, Ying Li School of Software and Microelectronics, Peking University, Beijing, China
14:50
10m
Talk
Efficient Failure Management for Multi-Agent Systems with Reasoning Trace Representation
Ideas, Visions and Reflections
Lingzhe Zhang Peking University, China, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Mingyu Wang Peking University, Weijie Hong Peking university, Chiming Duan Peking University, Minghua He Peking University, Rongqian Wang Huawei Theory Lab, Xi Peng Huawei Theory Lab, Meiling Wang Huawei America Lab, Nicholas Zhang Huawei Theory Lab, Renhai Chen Huawei Theory Lab, Ying Li School of Software and Microelectronics, Peking University, Beijing, China
15:00
20m
Talk
FaultWeave: Bounded Resilience Testing with Failure Diagnosis Capability for Microservice Applications
Industry Papers
Mingzhuo Zheng Institute of Software, Chinese Academy of Sciences, Guoquan Wu Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences; University of Chinese Academy of Sciences Nanjing College; China Southern Power Grid, Jinbo Zhang Information Center, Guangdong Power Grid, Jun Wei Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Wei Chen Institute of Software at Chinese Academy of Sciences, Jiaxin Zhu Institute of Software at Chinese Academy of Sciences, Zheheng Liang Joint Laboratory on Cyberspace Security of China Southern Power Grid
15:20
10m
Talk
From Syntactic to Semantic Spectra for Fault Localization
Ideas, Visions and Reflections
Zhaorui Yang University of California, Riverside, Qian Zhang University of California at Riverside, Rajiv Gupta University of California at Riverside, Ashish Kundu Cisco Research