eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization
This program is tentative and subject to change.
Root cause analysis (RCA) for incidents in large-scale cloud systems is a complex, knowledge-intensive task that often requires significant manual effort from on-call engineers (OCEs). Improving RCA is vital for accelerating the incident resolution process and reducing service downtime and manual efforts. Recent advancements in Large-Language Models (LLMs) have proven to be effective in solving different stages of the incident management lifecycle including RCA. However, existing LLM-based RCA recommendations typically leverage default finetuning or retrieval augmented generation (RAG) methods with static, manually designed prompts, which lead to sub-optimal recommendations. In this work, we leverage ‘PromptWizard’, a state-of-the-art prompt optimization technique, to automatically identify the best optimized prompt instruction that is combined with semantically similar historical examples for querying underlying LLMs during inference. Moreover, by utilizing more than 180K historical incident data from Microsoft, we developed cost-effective finetuned small language models (SLMs) for RCA recommendation generation and demonstrate the power of prompt optimization on such domain-adapted models. Our extensive experimental results show that ‘PromptWizard’ can improve the accuracy of RCA recommendations by 21% and 13% on 3K test incidents over RAG-based LLMs and finetuned SLMs, respectively. Lastly, our human evaluation with incident owners have demonstrated the efficacy of prompt optimization on RCA recommendation tasks. These findings underscore the advantages of incorporating prompt optimization into AI for Operations (AIOps) systems, delivering substantial gains without increasing computational overhead.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
| 16:00 - 17:10 | |||
| 16:0010m Talk | Interaction-Aware Patch Assessment for Multi-Fault Automated Program Repair NIER Track Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI) | ||
| 16:1010m Talk | Simulated Interactive Debugging NIER Track Yannic Noller Ruhr University Bochum, Erick Chandra Singapore University of Technology and Design, Srinidhi HC Singapore University of Technology and Design, Kenny Choo Singapore University of Technology and Design, Cyrille Jegourel ISTD, Singapore University of Technology and Design, Oka Kurniawan Singapore University of Technology and Design, Chris Poskitt Singapore Management UniversityPre-print | ||
| 16:2010m Talk | KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale Industry Showcase Zeying Wang Beihang University, Junhong Liu Beihang University, Penghao Zhang Kuaishou Inc., Xiaoyang Sun University of Leeds, Xu Wang Beihang University, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang  | ||
| 16:3010m Talk | KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training Industry Showcase Yitang Yang Beihang University, Junhong Liu Beihang University, Jiapeng Chen Kuaishou Inc., Xiaoyang Sun University of Leeds, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang  | ||
| 16:4010m Talk | BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice Industry Showcase Yuanpeng Li ByteDance, Qi Long Carnegie Mellon University, Zhiyuan Yao Zhejiang University, Jian Xu ByteDance, Lintao Xie ByteDance, Xu He ByteDance, Lu Geng ByteDance, Xin Han ByteDance, Yueyan Chen ByteDance, Wenbo Duan ByteDance | ||
| 16:5010m Talk | TrioXpert: An automated incident management framework for microservice system Industry Showcase Yongqian Sun Nankai University, Yu Luo Nankai University, Xidao Wen BizSeer, Yuan Yuan National University of Defense Technology, China, Xiaohui Nie Computer Network Information Center at Chinese Academy of Sciences, Shenglin Zhang Nankai University, Tong Liu Lenovo (TianJin) Co., Ltd., Xi Luo Lenovo (TianJin) Co., Ltd. | ||
| 17:0010m Talk | eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization Industry Showcase Drishti Goel University of Illinois Urbana Champaign, Raghav Magazine Microsoft Research, Supriyo Ghosh Microsoft, Akshay Nambi Microsoft Research, Prathamesh Deshpande Microsoft, Xuchao Zhang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft | ||


