KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale
This program is tentative and subject to change.
The resilience and reliability of large-scale AI training platforms are fundamental to enabling contemporary AI innovation and business development. However, with the rapid increase in the scale and complexity of AI model training tasks, anomalies become the norm rather than the exception at scale. Failing to handle them properly may lead to enormous resource waste and prolonged development cycles. Traditional anomaly detection methods struggle to tackle the complex temporal characteristics and extreme class imbalance inherently manifesting in training tasks, and fall short in automated solution to root cause analysis and the follow-up remediation. This paper proposes KAIOps, an end-to-end automated platform solution for anomaly handling in large-scale AI training clusters, and details its implementation for improving AIOps in daily operational maintenance at Kuaishou. KAIOps employs a Temporal Context Encoding mechanism to precisely capture and encode long-term trends and critical temporal context information within fault evolution. The detection model integrates a dynamic class-weighted loss function for enhancing the detection performance. KAIOps further advances these capabilities by integrating knowledge graph and large language model for automated root cause analysis and actionable solution generation, delivering a complete end-to-end intelligent processing pipeline. We conducted a systematic evaluation of KAIOps on the basis of data collected from Kuaishou’s production-grade training clusters and the results show the competitive performance of the proposed approach. KAIOps has been deployed in Kuaishou, in both testbed and production grade environments, consisting of with over 10,000 GPUs, and accelerate the reliability assurance for industry-scale model training and serving.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
16:00 - 17:10 | |||
16:00 10mTalk | Interaction-Aware Patch Assessment for Multi-Fault Automated Program Repair NIER Track Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI) | ||
16:10 10mTalk | Simulated Interactive Debugging NIER Track Yannic Noller Ruhr University Bochum, Erick Chandra Singapore University of Technology and Design, Srinidhi HC Singapore University of Technology and Design, Kenny Choo Singapore University of Technology and Design, Cyrille Jegourel ISTD, Singapore University of Technology and Design, Oka Kurniawan Singapore University of Technology and Design, Chris Poskitt Singapore Management University Pre-print | ||
16:20 10mTalk | KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale Industry Showcase Zeying Wang Beihang University, Junhong Liu Beihang University, Penghao Zhang Kuaishou Inc., Xiaoyang Sun University of Leeds, Xu Wang Beihang University, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang | ||
16:30 10mTalk | KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training Industry Showcase Yitang Yang Beihang University, Junhong Liu Beihang University, Jiapeng Chen Kuaishou Inc., Xiaoyang Sun University of Leeds, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang | ||
16:40 10mTalk | BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice Industry Showcase Yuanpeng Li ByteDance, Qi Long Carnegie Mellon University, Zhiyuan Yao Zhejiang University, Jian Xu ByteDance, Lintao Xie ByteDance, Xu He ByteDance, Lu Geng ByteDance, Xin Han ByteDance, Yueyan Chen ByteDance, Wenbo Duan ByteDance | ||
16:50 10mTalk | TrioXpert: An automated incident management framework for microservice system Industry Showcase Yongqian Sun Nankai University, Yu Luo Nankai University, Xidao Wen BizSeer, Yuan Yuan National University of Defense Technology, China, Xiaohui Nie Computer Network Information Center at Chinese Academy of Sciences, Shenglin Zhang Nankai University, Tong Liu Lenovo (TianJin) Co., Ltd., Xi Luo Lenovo (TianJin) Co., Ltd. | ||
17:00 10mTalk | eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization Industry Showcase Drishti Goel University of Illinois Urbana Champaign, Raghav Magazine Microsoft Research, Supriyo Ghosh Microsoft, Akshay Nambi Microsoft Research, Prathamesh Deshpande Microsoft, Xuchao Zhang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft | ||