KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training
This program is tentative and subject to change.
Large-scale distributed training is the driver of modern artificial intelligence, but its performance is critically vulnerable to “stragglers”, i.e., workers that slow down the entire process, leading to immense resource waste in industrial settings. Existing profiling tools focus on single-node analysis and lack a global perspective to identify systemic bottlenecks in distributed settings. In addition, recent emerging performance analysis tools can handle stragglers but lack diagnostic insights that engineers need to determine root causes. We introduce KAIR, a production-grade observability system that fills this critical diagnostic gap. KAIR is built on two core principles: (1) a low-overhead, scalable architecture to collect and aggregate fine-grained traces from thousands of nodes and (2) a novel hierarchical analysis engine that moves from statistical anomaly detection to causal inference. KAIR moves beyond simple timeline visualization by introducing a suite of techniques, including Kolmogorov-Smirnov statistics to identify divergent/unhealthy nodes and z-score analysis to pinpoint anomalous operators. This allows KAIR to pinpoint not only which worker is a straggler, but the specific root cause responsible for the delay. Evaluations were performed in Kuaishou production clusters. The experiment results indicate KAIR is highly effective in identifying latent performance bottlenecks overlooked by conventional tools. It performs negligible overhead, scales to large clusters, and provides actionable insights that have significantly reduced computational waste and engineering efforts.
This program is tentative and subject to change.
Mon 17 NovDisplayed time zone: Seoul change
16:00 - 17:10 | |||
16:00 10mTalk | Interaction-Aware Patch Assessment for Multi-Fault Automated Program Repair NIER Track Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI) | ||
16:10 10mTalk | Simulated Interactive Debugging NIER Track Yannic Noller Ruhr University Bochum, Erick Chandra Singapore University of Technology and Design, Srinidhi HC Singapore University of Technology and Design, Kenny Choo Singapore University of Technology and Design, Cyrille Jegourel ISTD, Singapore University of Technology and Design, Oka Kurniawan Singapore University of Technology and Design, Chris Poskitt Singapore Management University Pre-print | ||
16:20 10mTalk | KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale Industry Showcase Zeying Wang Beihang University, Junhong Liu Beihang University, Penghao Zhang Kuaishou Inc., Xiaoyang Sun University of Leeds, Xu Wang Beihang University, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang | ||
16:30 10mTalk | KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training Industry Showcase Yitang Yang Beihang University, Junhong Liu Beihang University, Jiapeng Chen Kuaishou Inc., Xiaoyang Sun University of Leeds, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang | ||
16:40 10mTalk | BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice Industry Showcase Yuanpeng Li ByteDance, Qi Long Carnegie Mellon University, Zhiyuan Yao Zhejiang University, Jian Xu ByteDance, Lintao Xie ByteDance, Xu He ByteDance, Lu Geng ByteDance, Xin Han ByteDance, Yueyan Chen ByteDance, Wenbo Duan ByteDance | ||
16:50 10mTalk | TrioXpert: An automated incident management framework for microservice system Industry Showcase Yongqian Sun Nankai University, Yu Luo Nankai University, Xidao Wen BizSeer, Yuan Yuan National University of Defense Technology, China, Xiaohui Nie Computer Network Information Center at Chinese Academy of Sciences, Shenglin Zhang Nankai University, Tong Liu Lenovo (TianJin) Co., Ltd., Xi Luo Lenovo (TianJin) Co., Ltd. | ||
17:00 10mTalk | eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization Industry Showcase Drishti Goel University of Illinois Urbana Champaign, Raghav Magazine Microsoft Research, Supriyo Ghosh Microsoft, Akshay Nambi Microsoft Research, Prathamesh Deshpande Microsoft, Xuchao Zhang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft | ||