MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services’ anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency.
Tue 29 OctDisplayed time zone: Pacific Time (US & Canada) change
13:30 - 15:00 | |||
13:30 15mTalk | Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We? Research Papers Pre-print | ||
13:45 15mTalk | The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier Research Papers Yongqi Han Tongji University, Qingfeng Du Tongji University, Ying Huang Tongji University, Jiaqi Wu Zhejiang University, Fulong Tian Di-Matrix(Shanghai) Information Technology Co., Ltd, Cheng He Di-Matrix(Shanghai) Information Technology Co., Ltd | ||
14:00 15mTalk | MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data Research Papers Wang yidan The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Zhouruixing Zhu Chinese University of Hong Kong, Shenzhen, Qiuai Fu Huawei Cloud Computing Technologies CO., LTD., Yuchi Ma Huawei Cloud Computing Technologies, Pinjia He Chinese University of Hong Kong, Shenzhen | ||
14:15 15mTalk | Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization Research Papers Lei Tao Nankai University, Shenglin Zhang Nankai University, ZedongJia Nankai University, Jinrui Sun Nankai University, Minghua Ma Microsoft Research, Zhengdan Li Nankai University, Yongqian Sun Nankai University, Canqun Yang National University of Defense Technology, Yuzhi Zhang Nankai University, Dan Pei Tsinghua University |