TCSE logo 
 Sigsoft logo
Sustainability badge
Sat 3 May 2025 14:56 - 15:12 at 210 - Session 3 Chair(s): Jian Zhang

Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond hostlevel scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates finegrained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.

Sat 3 May

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
Session 3AIOps at 210
Chair(s): Jian Zhang Microsoft
14:00
40m
Keynote
Keynote2: AIOps in the Era of Large Language Models
AIOps
Wahab Hamou-Lhadj Concordia University, Montreal, Canada
14:40
16m
Talk
Automated Lifting for Cloud Infrastructure-as-Code Programs
AIOps
Jingjia Peng University of Michigan, Yiming Qiu University of Michigan, Patrick Tser Jern Kon University of Michigan, Pinhan Zhao University of Michigan, Yibo Huang University of Michigan, Zheng Guo University of California, San Diego, Xinyu Wang University of Michigan, Ang Chen University of Michigan
14:56
16m
Talk
Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure
AIOps
Ze Li Microsoft Azure, Chang Lou University of Virginia, Vignatha Yenugutala Microsoft Azure, Vivek Ramamurthy Microsoft Azure, Eion Blanchard Microsoft Azure, Minghua Ma Microsoft, Murali Chintalapati Microsoft Azure
15:12
18m
Talk
Automated Service Design with Cerulean (Project Showcase)
AIOps
Vaastav Anand , Alok Kumbhare Microsoft Research, n.n., Celine Irvene Microsoft, Chetan Bansal Microsoft Research, Gagan Somashekar Microsoft, Jonathan Mace Microsoft, Pedro Las-Casas Microsoft, Rodrigo Fonseca Microsoft Research
:
:
:
: