ASE 2025
Sun 16 - Thu 20 November 2025 Seoul, South Korea

This program is tentative and subject to change.

Mon 17 Nov 2025 16:30 - 16:40 at Grand Hall 1 - Program Repair 3

Large-scale distributed training is the driver of modern artificial intelligence, but its performance is critically vulnerable to “stragglers”, i.e., workers that slow down the entire process, leading to immense resource waste in industrial settings. Existing profiling tools focus on single-node analysis and lack a global perspective to identify systemic bottlenecks in distributed settings. In addition, recent emerging performance analysis tools can handle stragglers but lack diagnostic insights that engineers need to determine root causes. We introduce KAIR, a production-grade observability system that fills this critical diagnostic gap. KAIR is built on two core principles: (1) a low-overhead, scalable architecture to collect and aggregate fine-grained traces from thousands of nodes and (2) a novel hierarchical analysis engine that moves from statistical anomaly detection to causal inference. KAIR moves beyond simple timeline visualization by introducing a suite of techniques, including Kolmogorov-Smirnov statistics to identify divergent/unhealthy nodes and z-score analysis to pinpoint anomalous operators. This allows KAIR to pinpoint not only which worker is a straggler, but the specific root cause responsible for the delay. Evaluations were performed in Kuaishou production clusters. The experiment results indicate KAIR is highly effective in identifying latent performance bottlenecks overlooked by conventional tools. It performs negligible overhead, scales to large clusters, and provides actionable insights that have significantly reduced computational waste and engineering efforts.

This program is tentative and subject to change.

Mon 17 Nov

Displayed time zone: Seoul change

16:00 - 17:10
16:00
10m
Talk
Interaction-Aware Patch Assessment for Multi-Fault Automated Program Repair
NIER Track
Omar I. Al-Bataineh Gran Sasso Science Institute (GSSI)
16:10
10m
Talk
Simulated Interactive Debugging
NIER Track
Yannic Noller Ruhr University Bochum, Erick Chandra Singapore University of Technology and Design, Srinidhi HC Singapore University of Technology and Design, Kenny Choo Singapore University of Technology and Design, Cyrille Jegourel ISTD, Singapore University of Technology and Design, Oka Kurniawan Singapore University of Technology and Design, Chris Poskitt Singapore Management University
Pre-print
16:20
10m
Talk
KAIOps: A Platform Solution of End-to-End Multi-Modal AIOps for AI Training at Scale
Industry Showcase
Zeying Wang Beihang University, Junhong Liu Beihang University, Penghao Zhang Kuaishou Inc., Xiaoyang Sun University of Leeds, Xu Wang Beihang University, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang
16:30
10m
Talk
KAIR: A Statistical and Causal Approach to Pinpointing Stragglers in Distributed Model Training
Industry Showcase
Yitang Yang Beihang University, Junhong Liu Beihang University, Jiapeng Chen Kuaishou Inc., Xiaoyang Sun University of Leeds, Tianyu Wo , Chunming Hu Beihang University, Chengru Song Kuaishou Technology, Jin Ouyang Kuaishou Inc., Renyu Yang
16:40
10m
Talk
BitsAI-Fix: LLM-Driven Approach for Automated Lint Error Resolution in Practice
Industry Showcase
Yuanpeng Li ByteDance, Qi Long Carnegie Mellon University, Zhiyuan Yao Zhejiang University, Jian Xu ByteDance, Lintao Xie ByteDance, Xu He ByteDance, Lu Geng ByteDance, Xin Han ByteDance, Yueyan Chen ByteDance, Wenbo Duan ByteDance
16:50
10m
Talk
TrioXpert: An automated incident management framework for microservice system
Industry Showcase
Yongqian Sun Nankai University, Yu Luo Nankai University, Xidao Wen BizSeer, Yuan Yuan National University of Defense Technology, China, Xiaohui Nie Computer Network Information Center at Chinese Academy of Sciences, Shenglin Zhang Nankai University, Tong Liu Lenovo (TianJin) Co., Ltd., Xi Luo Lenovo (TianJin) Co., Ltd.
17:00
10m
Talk
eARCO: Efficient Automated Root Cause Analysis with Prompt Optimization
Industry Showcase
Drishti Goel University of Illinois Urbana Champaign, Raghav Magazine Microsoft Research, Supriyo Ghosh Microsoft, Akshay Nambi Microsoft Research, Prathamesh Deshpande Microsoft, Xuchao Zhang Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft