Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
Due to the scale and complexity of cloud systems, a system failure would trigger an “alert storm”, i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts.
To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.
Fri 19 AprDisplayed time zone: Lisbon change
14:00 - 15:30 | Dependability and Formal methods 3Research Track / Software Engineering in Practice / New Ideas and Emerging Results at Maria Helena Vieira da Silva Chair(s): Shahar Maoz Tel Aviv University | ||
14:00 15mTalk | It's Not a Feature, It's a Bug: Fault-Tolerant Model Mining from Noisy Data Research Track Felix Wallner Graz University of Technology, Institute of Software Technology, Bernhard Aichernig Graz University of Technology, Christian Burghard AVL List GmbH Link to publication DOI | ||
14:15 15mTalk | Verifying Declarative Smart Contracts Research Track Haoxian Chen ShanghaiTech University, Lan Lu University of Pennsylvania, Brendan Massey University of Pennsylvania, Yuepeng Wang Simon Fraser University, Boon Thau Loo University of Pennsylvania | ||
14:30 15mTalk | Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach Software Engineering in Practice Jinxi Kuang The Chinese University of Hong Kong, Jinyang Liu The Chinese University of Hong Kong, Junjie Huang The Chinese University of Hong Kong, Renyi Zhong The Chinese University of Hong Kong, Jiazhen Gu The Chinese University of Hong Kong, Lan Yu Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Rui Tan Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Michael Lyu The Chinese University of Hong Kong | ||
14:45 15mTalk | Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach Software Engineering in Practice Pooja Srinivas Microsoft, Fiza Husain Microsoft, Anjaly Parayil Microsoft, Ayush Choure Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft | ||
15:00 15mTalk | FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems Software Engineering in Practice Junjie Huang The Chinese University of Hong Kong, Jinyang Liu The Chinese University of Hong Kong, Zhuangbin Chen School of Software Engineering, Sun Yat-sen University, Zhihan Jiang The Chinese University of Hong Kong, Yichen LI The Chinese University of Hong Kong, Jiazhen Gu The Chinese University of Hong Kong, Cong Feng Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Yongqiang Yang Huawei Technologies, Michael Lyu The Chinese University of Hong Kong | ||
15:15 7mTalk | Translating between SQL Dialects for Cloud Migration Software Engineering in Practice Ran Zmigrod JP Morgan - Chase, Salwa Alamir J.P. Morgan AI Research, Xiaomo Liu JP Morgan AI Research | ||
15:22 7mTalk | Designing Trustful Cooperation Ecosystems is Key to the New Space Exploration Era New Ideas and Emerging Results Renan Lima Baima University of Luxembourg, Loïck Chovet University of Luxembourg, Johannes Sedlmeir University of Luxembourg, Miguel A. Olivares-Mendez University of Luxembourg, Gilbert Fridgen University of Luxembourg |