Aloha: Localizing Batch Failures in Large-scale Cloud Systems via Contrast Analysis and Human-in-the-Loop Agent
This program is tentative and subject to change.
Large-scale cloud systems underpin modern computing, hosting diverse components to deliver critical services worldwide. A single fault—such as an outage or misconfiguration—can simultaneously disrupt thousands of users. Such large-scale faults, referred to as batch failures, are characterized by many affected instances across the same subject within a short time window, typically stemming from a shared root cause. Handling these failures efficiently requires anomaly localization, but existing approaches offer insufficient support to engineers, making the process time-consuming and cognitively demanding. To address this, we propose \textbf{Aloha}, a human-in-the-loop agent framework for anomaly localization based on contrast analysis. Aloha operationalizes the entire batch failure handling pipeline, providing scenario- and data-aware guidance along with interpretable root-cause patterns for engineers. Pilots on real-world batch failure cases in Microsoft’s cloud show that Aloha streamlines data handling, supports contrast-based anomaly localization, and makes the process more practical and accessible, offering a promising step toward human-centered, scalable failure management in large-scale cloud systems.
This program is tentative and subject to change.
Wed 8 JulDisplayed time zone: Eastern Time (US & Canada) change
10:30 - 12:30 | |||
10:30 20mTalk | Aloha: Localizing Batch Failures in Large-scale Cloud Systems via Contrast Analysis and Human-in-the-Loop Agent Industry Papers Shenglin Zhang Nankai University, Yujia Wu Nankai University, Jinghuan Ren Nankai University, College of Software, Yongqian Sun Nankai University, Wenwei Gu Nankai University, Chaoyun Zhang Microsoft, Liqun Li Microsoft Research, Qingwei Lin Microsoft, Dongmei Zhang Microsoft, Saravanakumar Rajmohan Microsoft 365, Chetan Bansal Microsoft Research, Minghua Ma Microsoft | ||
10:50 20mTalk | Attention Enhanced Entity Recommendation for Intelligent Monitoring in Cloud Systems Industry Papers Fiza Husain Independent, Anson Bastos Microsoft, Anjaly Parayil Microsoft, Ayush Choure Independent, Chetan Bansal Microsoft Research, Rujia Wang Microsoft, Saravanakumar Rajmohan Microsoft 365 | ||
11:10 20mTalk | An Agentic Framework for Triaging Incidents in Production Cloud Infrastructure Industry Papers Yuhan Yao Microsoft, Yuxuan Jiang University of Michigan Ann-Arbor, Minghua Ma Microsoft, Madhura Vaidya Microsoft, Jieren Deng Microsoft, Yigong Hu Boston University, Chetan Bansal Microsoft Research, Ze Li Microsoft Azure, Murali Chintalapati Microsoft Azure | ||
11:30 20mTalk | TSGuard: Automated User-Centric Incident Diagnosis for AI Workloads in the Cloud Research Papers Yitao Yang The Chinese University of Hong Kong, Yangtao Deng The Chinese University of Hong Kong, Yifan Xiong Microsoft Research, Baochun Li University of Toronto, Hong Xu The Chinese University of Hong Kong, Peng Cheng Microsoft Research Asia | ||
11:50 20mTalk | Exploring the impact of cloud computing on software architecture for sustainability: A practitioners' perspective Journal-First Paper | ||
12:10 20mTalk | AccessRefinery: Fast Mining Concise Access Control Intents on Public Cloud Research Papers Ning Kang Xi'an Jiaotong University, Peng Zhang Xi'an Jiaotong University, Jianyuan Zhang Xi'an Jiaotong University, Hao Li Xi'an Jiaotong University, Dan Wang Xi'an Jiaotong University, Zhenrong Gu Xi'an Jiaotong University, Weibo Lin Huawei Cloud, Shibiao Jiang Huawei Cloud, Zhu He Huawei Cloud, Xu Du Huawei Cloud, Longfei Chen Huawei Cloud, Jun Li Huawei, Xiaohong Guan Xi'an Jiaotong University | ||