ASE 2024
Sun 27 October - Fri 1 November 2024 Sacramento, California, United States
Tue 29 Oct 2024 14:00 - 14:15 at Compagno - Root-cause analysis Chair(s): Curtis Atkisson

Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services’ anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency.

Tue 29 Oct

Displayed time zone: Pacific Time (US & Canada) change

13:30 - 15:00
Root-cause analysisResearch Papers at Compagno
Chair(s): Curtis Atkisson UW
13:30
15m
Talk
Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?
Research Papers
Luan Pham RMIT University, Huong Ha RMIT University, Hongyu Zhang Chongqing University
Pre-print
13:45
15m
Talk
The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier
Research Papers
Yongqi Han Tongji University, Qingfeng Du Tongji University, Ying Huang Tongji University, Jiaqi Wu Zhejiang University, Fulong Tian Di-Matrix(Shanghai) Information Technology Co., Ltd, Cheng He Di-Matrix(Shanghai) Information Technology Co., Ltd
14:00
15m
Talk
MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
Research Papers
Wang yidan The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Zhouruixing Zhu Chinese University of Hong Kong, Shenzhen, Qiuai Fu Huawei Cloud Computing Technologies CO., LTD., Yuchi Ma Huawei Cloud Computing Technologies, Pinjia He Chinese University of Hong Kong, Shenzhen
14:15
15m
Talk
Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization
Research Papers
Lei Tao Nankai University, Shenglin Zhang Nankai University, ZedongJia Nankai University, Jinrui Sun Nankai University, Minghua Ma Microsoft Research, Zhengdan Li Nankai University, Yongqian Sun Nankai University, Canqun Yang National University of Defense Technology, Yuzhi Zhang Nankai University, Dan Pei Tsinghua University