The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier
Failure root cause analysis (RCA), which systematically identifies underlying faults, is essential for ensuring the reliability of widely adopted microservice-based applications and cloud-native systems. However, manual analysis by simple rules faces significant burdens due to the heterogeneous nature of resource entities and the massive amount of observability data. Furthermore, existing approaches for automating RCA struggle to perform in-depth fault analysis without extensive fault labels. To address the scarcity of fault labels, we examine an extreme RCA scenario where each fault type has only one example (one-shot). We propose \textit{LasRCA}, a framework for one-hot RCA in cloud-native systems that leverages the collaboration of the LLM and small classifier. During the training stage, \textit{LasRCA} initially trains a small classifier based on one-shot fault examples. The small classifier then iteratively selects high-confusion samples and receives feedback on their fault types from LLM-driven fault labeling. These samples are applied to retrain the small classifier. During the inference stage, \textit{LasRCA} performs a joint RCA through the collaboration of the LLM and small classifier, achieving a trade-off between effectiveness and cost. Experiments results on public datasets with heterogeneous nature and prevalent fault types show the effectiveness of \textit{LasRCA} in one-shot RCA.
Tue 29 OctDisplayed time zone: Pacific Time (US & Canada) change
13:30 - 15:00 | |||
13:30 15mTalk | Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We? Research Papers Pre-print | ||
13:45 15mTalk | The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier Research Papers Yongqi Han Tongji University, Qingfeng Du Tongji University, Ying Huang Tongji University, Jiaqi Wu Zhejiang University, Fulong Tian Di-Matrix(Shanghai) Information Technology Co., Ltd, Cheng He Di-Matrix(Shanghai) Information Technology Co., Ltd | ||
14:00 15mTalk | MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data Research Papers Wang yidan The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), Zhouruixing Zhu Chinese University of Hong Kong, Shenzhen, Qiuai Fu Huawei Cloud Computing Technologies CO., LTD., Yuchi Ma Huawei Cloud Computing Technologies, Pinjia He Chinese University of Hong Kong, Shenzhen | ||
14:15 15mTalk | Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization Research Papers Lei Tao Nankai University, Shenglin Zhang Nankai University, ZedongJia Nankai University, Jinrui Sun Nankai University, Minghua Ma Microsoft Research, Zhengdan Li Nankai University, Yongqian Sun Nankai University, Canqun Yang National University of Defense Technology, Yuzhi Zhang Nankai University, Dan Pei Tsinghua University |