Alert Summarization for Online Service Systems by Validating Propagation Paths of Faults
For online service systems, alerts are crucial for root cause analysis as they capture symptoms triggered by system faults. In real-world scenarios, a fault can propagate across multiple system components, generating a large volume of alerts. Various approaches have been proposed to summarize alerts into incidents to accelerate root cause analysis, using the topology information. However, these approaches focus solely on connectivity, neglecting the semantics of the topology, which significantly impacts their performance. In this paper, we introduce ProAlert, a novel topology-based approach that summarizes alerts into incidents by validating fault propagation paths. ProAlert first unsupervisedly learns fault propagation patterns from historical alerts and system topology offline. It then uses these patterns to validate fault paths in real-time alerts, leading to more accurate incident summarization. Moreover, the fault propagation paths provided by ProAlert improve the interpretability of incidents, assisting maintenance engineers in understanding the root causes of faults. To demonstrate the effectiveness and efficiency of ProAlert, we conduct extensive experiments on real-world data. The results show that ProAlert outperforms state-of-the-art approaches.