Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. Oncall engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-of-the-art large language models like GPT-3, which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.
Fri 19 MayDisplayed time zone: Hobart change
11:00 - 12:30 | Runtime analysis and self-adaptationTechnical Track / NIER - New Ideas and Emerging Results / SEIP - Software Engineering in Practice / Journal-First Papers at Level G - Plenary Room 1 Chair(s): Domenico Bianculli University of Luxembourg | ||
11:00 15mTalk | Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention Technical Track Cheryl Lee The Chinese University of Hong Kong, Tianyi Yang The Chinese University of Hong Kong, Zhuangbin Chen Chinese University of Hong Kong, China, Yuxin Su Sun Yat-sen University, Yongqiang Yang Huawei Technologies, Michael Lyu The Chinese University of Hong Kong Pre-print | ||
11:15 15mTalk | Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models Technical Track Toufique Ahmed University of California at Davis, Supriyo Ghosh Microsoft, Chetan Bansal Microsoft Research, Thomas Zimmermann Microsoft Research, Xuchao Zhang Microsoft, Saravanakumar Rajmohan Microsoft 365 Pre-print | ||
11:30 15mTalk | Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data Technical Track Cheryl Lee The Chinese University of Hong Kong, Tianyi Yang The Chinese University of Hong Kong, Zhuangbin Chen Chinese University of Hong Kong, China, Yuxin Su Sun Yat-sen University, Michael Lyu The Chinese University of Hong Kong Pre-print | ||
11:45 15mTalk | LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly Technical Track Guangba Yu Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Pairui Li Tencent Inc., Tianjun Weng Tencent Inc., Haibing Zheng Tencent, Yuetang Deng Tencent, Zibin Zheng School of Software Engineering, Sun Yat-sen University Pre-print | ||
12:00 15mTalk | TraceArk: Towards Actionable Performance Anomaly Alerting for Online Service Systems SEIP - Software Engineering in Practice Zhengran Zeng Southern University of Science and Technology, Yuqun Zhang Southern University of Science and Technology, Yong Xu Microsoft Research, Minghua Ma Microsoft Research, Bo Qiao Microsoft Research, Wentao Zou , Qingjun Chen , Meng Zhang , Xu Zhang Microsoft Research, Hongyu Zhang The University of Newcastle, Xuedong Gao , Hao Fan , Saravan Rajmohan Microsoft 365, Qingwei Lin Microsoft Research, Dongmei Zhang Microsoft Research | ||
12:15 7mTalk | ActivFORMS: A Formally-Founded Model-Based Approach to Engineer Self-Adaptive Systems Journal-First Papers | ||
12:22 7mTalk | Auto-Logging: AI-centred Logging Instrumentation NIER - New Ideas and Emerging Results Pre-print |