AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems (ASE 2025 - Research Papers)

Who

Guangba Yu, Genting Mai, Rui Wang, Ruipeng Li, Pengfei Chen, Long Pan, Ruijie Xu

Track

ASE 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 17 Nov 2025 14:10 - 14:20 at Grand Hall 5 - Software Process

Abstract

Alerts are critical for detecting anomalies in large-scale cloud systems, ensuring reliability and user experience. However, current systems generate overwhelming volumes of alerts, degrading operational efficiency due to ineffective alert life-cycle management. This experience paper details the efforts of Company-X to optimize alert life-cycle management, addressing alert fatigue in cloud systems. We propose AlertGuardian, a framework collaborating large language models (LLMs) and lightweight graph models to optimize the alert life-cycle through three phases: Alert Denoise uses graph learning model with virtual noise to filter noise, Alert Summary employs Retrieval Augmented Generation (RAG) with LLMs to create actionable alert summary, and Alert Rule Refinement leverages multi-agent iterative feedbacks to improve alert rule quality. Evaluated on four real-world datasets from Company-X’s services, AlertGuardian significantly mitigates alert fatigue (94.8% alert reduction ratios) and accelerates fault diagnosis (90.5% diagnosis accuracy). Moreover, AlertGuardian improves 1,174 alert rules, with 375 accepted by SREs (32% acceptance rate). Finally, we share success stories and lessons learned about alert life-cycle management from the deployment of AlertGuardian at Company-X.

Guangba Yu

The Chinese University of Hong Kong

Hong Kong SAR China

Genting Mai

Sun Yat-sen University

Rui Wang

Tencent

Ruipeng Li

Tencent

Pengfei Chen

Sun Yat-sen University

Long Pan

Tencent

Ruijie Xu

Tencent

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 17 Nov
Displayed time zone: Seoul change

14:00 - 15:30	Software ProcessResearch Papers / Journal-First Track at Grand Hall 5

14:00 10m Talk		LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM Research Papers Yuxin Zhang Beijing Institute of Technology, Yuxia Zhang Beijing Institute of Technology, Zeyu Sun Institute of Software, Chinese Academy of Sciences, Yanjie Jiang Peking University, Hui Liu Beijing Institute of Technology
14:10 10m Talk		AlertGuardian: Intelligent Alert Life-Cycle Management for Large-scale Cloud Systems Research Papers Guangba Yu The Chinese University of Hong Kong, Genting Mai Sun Yat-sen University, Rui Wang Tencent, Ruipeng Li Tencent, Pengfei Chen Sun Yat-sen University, Long Pan Tencent, Ruijie Xu Tencent
14:20 10m Talk		SPICE : An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation Research Papers Aaditya Bhatia Queen's University, Gustavo Oliva Centre for Software Excellence, Huawei Canada, Gopi Krishnan Rajbahadur Centre for Software Excellence, Huawei, Canada, Haoxiang Zhang Huawei, Yihao Chen Center for Software Excellence, Huawei Canada, Zhilong Chen Center for Software Excellence, Huawei Canada, Arthur Leung Center for Software Excellence, Huawei Canada, Dayi Lin Centre for Software Excellence, Huawei Canada, Boyuan Chen Centre for Software Excellence, Huawei Canada, Ahmed E. Hassan Queen’s University
14:30 10m Talk		Managing the variability of a logistics robotic system Journal-First Track Kentaro Yoshimura Hitachi, Ltd., Yuta Yamauchi Hitachi, Ltd., Hideo Takahashi Hitachi, Ltd.
14:40 10m Talk		Sprint2Vec: A Deep Characterization of Sprints in Iterative Software Development Journal-First Track Morakot Choetkiertikul Mahidol University, Thailand, Peerachai Banyongrakkul Mahidol University, Chaiyong Rakhitwetsagul Mahidol University, Thailand, Suppawong Tuarob Mahidol University, Hoa Khanh Dam University of Wollongong, Thanwadee Sunetnanta Mahidol University
14:50 10m Talk		Supporting Emotional Intelligence, Productivity and Team Goals while Handling Software Requirements Changes Journal-First Track Kashumi Madampe Monash University, Australia, Rashina Hoda Monash University, John Grundy Monash University
15:00 10m Talk		Rechecking Recheck Requests in Continuous Integration: An Empirical Study of OpenStack Research Papers Yelizaveta Brus University of Waterloo, Rungroj Maipradit University of Waterloo, Earl T. Barr University College London, Shane McIntosh University of Waterloo
15:10 10m Talk		An LLM-based multi-agent framework for agile effort estimation Research Papers Long Bui University of Wollongong, Hoa Khanh Dam University of Wollongong, Rashina Hoda Monash University
15:20 10m Talk		From Characters to Structure: Rethinking Real-Time Collaborative Programming Models Research Papers Leon Freudenthaler FH Campus Wien, Bernhard Taufner FH Campus Wien, Karl M. Göschka TU Wien