Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure (AIOps 2025)

Who

Ze Li, Chang Lou, Vignatha Yenugutala, Vivek Ramamurthy, Eion Blanchard, Minghua Ma, Murali Chintalapati

Track

AIOps 2025 AI for Cloud Service

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 3 May 2025 14:56 - 15:12 at 210 - Session 3 Chair(s): Jian Zhang

Abstract

Cloud infrastructure in production constantly experiences gray failures: a degraded state in which failures go undetected by system mechanisms, yet adversely affect end-users. Addressing the underlying anomalies on host nodes is crucial to address gray failures. However, current approaches suffer from two key limitations: first, existing detection relies solely on singular-dimension signals from hosts, thus often suffering from biased views due to differential observability; second, existing mitigation actions are often insufficient, primarily consisting of host-level operations such as reboots, which leave most production issues to manual intervention. This paper presents PANACEA, a holistic framework to automatically detect and mitigate host anomalies, addressing gray failures in production cloud infrastructure. PANACEA expands beyond hostlevel scope: it aggregates and correlates insights from VMs and application layers to bridge the detection gap, and orchestrates finegrained and safe mitigation across all levels. PANACEA is versatile, designed to support a wide range of anomalies. It has been deployed in production at millions of hosts.

Ze Li

Microsoft Azure

United States

Chang Lou

University of Virginia

United States

Vignatha Yenugutala

Microsoft Azure

United States

Vivek Ramamurthy

Microsoft Azure

United States

Eion Blanchard

Microsoft Azure

United States

Minghua Ma

Microsoft

United States

Murali Chintalapati

Microsoft Azure

United States

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 3 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Session 3AIOps at 210 Chair(s): Jian Zhang Microsoft

14:00 40m Keynote		Keynote2: AIOps in the Era of Large Language Models AIOps Wahab Hamou-Lhadj Concordia University, Montreal, Canada
14:40 16m Talk		Automated Lifting for Cloud Infrastructure-as-Code Programs AIOps Jingjia Peng University of Michigan, Yiming Qiu University of Michigan, Patrick Tser Jern Kon University of Michigan, Pinhan Zhao University of Michigan, Yibo Huang University of Michigan, Zheng Guo University of California, San Diego, Xinyu Wang University of Michigan, Ang Chen University of Michigan
14:56 16m Talk		Orchestrating Cross-Layer Anomaly Detection and Mitigation to Address Gray Failures in Large-Scale Cloud Infrastructure AIOps Ze Li Microsoft Azure, Chang Lou University of Virginia, Vignatha Yenugutala Microsoft Azure, Vivek Ramamurthy Microsoft Azure, Eion Blanchard Microsoft Azure, Minghua Ma Microsoft, Murali Chintalapati Microsoft Azure
15:12 18m Talk		Automated Service Design with Cerulean (Project Showcase) AIOps Vaastav Anand , Alok Kumbhare Microsoft Research, n.n., Celine Irvene Microsoft, Chetan Bansal Microsoft Research, Gagan Somashekar Microsoft, Jonathan Mace Microsoft, Pedro Las-Casas Microsoft, Rodrigo Fonseca Microsoft Research