ASE 2025
Sun 16 - Thu 20 November 2025 Seoul, South Korea

This program is tentative and subject to change.

Mon 17 Nov 2025 16:20 - 16:30 at Grand Hall 1 - Program Repair 3

The resilience and reliability of large-scale AI training platforms are fundamental to enabling contemporary AI innovation and business development. However, with the rapid increase in the scale and complexity of AI model training tasks, anomalies become the norm rather than the exception at scale. Failing to handle them properly may lead to enormous resource waste and prolonged development cycles. Traditional anomaly detection methods struggle to tackle the complex temporal characteristics and extreme class imbalance inherently manifesting in training tasks, and fall short in automated solution to root cause analysis and the follow-up remediation. This paper proposes KAIOps, an end-to-end automated platform solution for anomaly handling in large-scale AI training clusters, and details its implementation for improving AIOps in daily operational maintenance at Kuaishou. KAIOps employs a Temporal Context Encoding mechanism to precisely capture and encode long-term trends and critical temporal context information within fault evolution. The detection model integrates a dynamic class-weighted loss function for enhancing the detection performance. KAIOps further advances these capabilities by integrating knowledge graph and large language model for automated root cause analysis and actionable solution generation, delivering a complete end-to-end intelligent processing pipeline. We conducted a systematic evaluation of KAIOps on the basis of data collected from Kuaishou’s production-grade training clusters and the results show the competitive performance of the proposed approach. KAIOps has been deployed in Kuaishou, in both testbed and production grade environments, consisting of with over 10,000 GPUs, and accelerate the reliability assurance for industry-scale model training and serving.

This program is tentative and subject to change.

Mon 17 Nov

Displayed time zone: Seoul change

16:00 - 17:10
16:00
10m
Talk
Interaction-Aware Patch Assessment for Multi-Fault Automated Program Repair
NIER Track
Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI)
16:10
10m
Talk
Simulated Interactive Debugging
NIER Track
Yannic Noller Ruhr University Bochum, Erick Chandra Singapore University of Technology and Design, Srinidhi HC Singapore University of Technology and Design, Kenny Choo Singapore University of Technology and Design, Cyrille Jegourel ISTD, Singapore University of Technology and Design, Oka Kurniawan Singapore University of Technology and Design, Chris Poskitt Singapore Management University
Pre-print
16:20
10m
Talk
KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale
Industry Showcase
Zeying Wang Beihang University, Junhong Liu Beihang University, Penghao Zhang Kuaishou Inc., Xiaoyang Sun University of Leeds, Xu Wang Beihang University, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang
16:30
10m
Talk
KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training
Industry Showcase
Yitang Yang Beihang University, Junhong Liu Beihang University, Jiapeng Chen Kuaishou Inc., Xiaoyang Sun University of Leeds, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang
16:40
10m
Talk
BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice
Industry Showcase
Yuanpeng Li ByteDance, Qi Long Carnegie Mellon University, Zhiyuan Yao Zhejiang University, Jian Xu ByteDance, Lintao Xie ByteDance, Xu He ByteDance, Lu Geng ByteDance, Xin Han ByteDance, Yueyan Chen ByteDance, Wenbo Duan ByteDance
16:50
10m
Talk
TrioXpert: An automated incident management framework for microservice system
Industry Showcase
Yongqian Sun Nankai University, Yu Luo Nankai University, Xidao Wen BizSeer, Yuan Yuan National University of Defense Technology, China, Xiaohui Nie Computer Network Information Center at Chinese Academy of Sciences, Shenglin Zhang Nankai University, Tong Liu Lenovo (TianJin) Co., Ltd., Xi Luo Lenovo (TianJin) Co., Ltd.
17:00
10m
Talk
eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization
Industry Showcase
Drishti Goel University of Illinois Urbana Champaign, Raghav Magazine Microsoft Research, Supriyo Ghosh Microsoft, Akshay Nambi Microsoft Research, Prathamesh Deshpande Microsoft, Xuchao Zhang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft