AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems
This program is tentative and subject to change.
Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This experience paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable alert summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X’s services, AlertGuardian significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management from the deployment of AlertGuardian at Company-X.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
14:00 - 15:30 | |||
14:00 10mTalk | LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM Research Papers Yuxin Zhang Beijing Institute of Technology, Yuxia Zhang Beijing Institute of Technology, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Yanjie Jiang Peking University, Hui Liu Beijing Institute of Technology | ||
14:10 10mTalk | AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems Research Papers Guangba Yu The Chinese University of Hong Kong, Genting Mai Sun Yat-sen University, Rui Wang Tencent, Ruipeng Li Tencent, Pengfei Chen Sun Yat-sen University, Long Pan Tencent, Ruijie Xu Tencent | ||
14:20 10mTalk | SPICE : An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation Research Papers Aaditya Bhatia Queen's University, Gustavo Oliva Centre for Software Excellence, Huawei Canada, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Haoxiang Zhang Huawei, Yihao Chen Center for Software Excellence, Huawei Canada, Zhilong Chen Center for Software Excellence, Huawei Canada, Arthur Leung Center for Software Excellence, Huawei Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Boyuan Chen Centre for Software Excellence, Huawei Canada, Ahmed E. Hassan Queen’s University | ||
14:30 10mTalk | Managing the variability of a logistics robotic system Journal-First Track | ||
14:40 10mTalk | Sprint2Vec: A Deep Characterization of Sprints in Iterative Software Development Journal-First Track Morakot Choetkiertikul Mahidol University, Thailand, Peerachai Banyongrakkul Mahidol University, Chaiyong Rakhitwetsagul Mahidol University, Thailand, Suppawong Tuarob Mahidol University, Hoa Khanh Dam University of Wollongong, Thanwadee Sunetnanta Mahidol University | ||
14:50 10mTalk | Supporting Emotional Intelligence, Productivity and Team Goals while Handling Software Requirements Changes Journal-First Track Kashumi Madampe Monash University, Australia, Rashina Hoda Monash University, John Grundy Monash University | ||
15:00 10mTalk | Rechecking Recheck Requests in Continuous Integration: An Empirical Study of OpenStack Research Papers Yelizaveta Brus University of Waterloo, Rungroj Maipradit University of Waterloo, Earl T. Barr University College London, Shane McIntosh University of Waterloo | ||
15:10 10mTalk | An LLM-based multi-agent framework for agile effort estimation Research Papers Long Bui University of Wollongong, Hoa Khanh Dam University of Wollongong, Rashina Hoda Monash University | ||
15:20 10mTalk | From Characters to Structure: Rethinking Real-Time Collaborative Programming Models Research Papers | ||