ASE 2024
Sun 27 October - Fri 1 November 2024 Sacramento, California, United States
Tue 29 Oct 2024 13:45 - 14:00 at Compagno - Root-cause analysis Chair(s): Curtis Atkisson

Failure root cause analysis (RCA), which systematically identifies underlying faults, is essential for ensuring the reliability of widely adopted microservice-based applications and cloud-native systems. However, manual analysis by simple rules faces significant burdens due to the heterogeneous nature of resource entities and the massive amount of observability data. Furthermore, existing approaches for automating RCA struggle to perform in-depth fault analysis without extensive fault labels. To address the scarcity of fault labels, we examine an extreme RCA scenario where each fault type has only one example (one-shot). We propose \textit{LasRCA}, a framework for one-hot RCA in cloud-native systems that leverages the collaboration of the LLM and small classifier. During the training stage, \textit{LasRCA} initially trains a small classifier based on one-shot fault examples. The small classifier then iteratively selects high-confusion samples and receives feedback on their fault types from LLM-driven fault labeling. These samples are applied to retrain the small classifier. During the inference stage, \textit{LasRCA} performs a joint RCA through the collaboration of the LLM and small classifier, achieving a trade-off between effectiveness and cost. Experiments results on public datasets with heterogeneous nature and prevalent fault types show the effectiveness of \textit{LasRCA} in one-shot RCA.

Tue 29 Oct

Displayed time zone: Pacific Time (US & Canada) change

13:30 - 15:00
Root-cause analysisResearch Papers at Compagno
Chair(s): Curtis Atkisson UW
13:30
15m
Talk
Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?
Research Papers
Luan Pham RMIT University, Huong Ha RMIT University, Hongyu Zhang Chongqing University
Pre-print
13:45
15m
Talk
The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier
Research Papers
Yongqi Han Tongji University, Qingfeng Du Tongji University, Ying Huang Tongji University, Jiaqi Wu Zhejiang University, Fulong Tian Di-Matrix(Shanghai) Information Technology Co., Ltd, Cheng He Di-Matrix(Shanghai) Information Technology Co., Ltd
14:00
15m
Talk
MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
Research Papers
Wang yidan The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Zhouruixing Zhu Chinese University of Hong Kong, Shenzhen, Qiuai Fu Huawei Cloud Computing Technologies CO., LTD., Yuchi Ma Huawei Cloud Computing Technologies, Pinjia He Chinese University of Hong Kong, Shenzhen
14:15
15m
Talk
Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization
Research Papers
Lei Tao Nankai University, Shenglin Zhang Nankai University, ZedongJia Nankai University, Jinrui Sun Nankai University, Minghua Ma Microsoft Research, Zhengdan Li Nankai University, Yongqian Sun Nankai University, Canqun Yang National University of Defense Technology, Yuzhi Zhang Nankai University, Dan Pei Tsinghua University