Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems (ASE 2022 - Research Papers)

Who

Zilong He, Pengfei Chen, Yu Luo, Qiuyu Yan, Hongyang Chen, Guangba Yu, Fangyuan Li

Track

ASE 2022 Research Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 12 Oct 2022 17:10 - 17:30 at Gold A - Technical Session 20 - Web, Cloud, Networking Chair(s): Karine Even-Mendoza

Abstract

With the ever increasing scale and complexity of online systems, incidents are gradually becoming commonplace. Without appropriate handling, they can seriously harm the system availability. However, in large-scale online systems, these incidents are usually drowning in a slew of issues (i.e., something abnormal, while not necessarily an incident), rendering them difficult to handle. Typically, these issues will result in a cascading effect across the system, and a proper management of the incidents depends heavily on a thorough analysis of this effect. Therefore, in this paper, we propose a method to automatically analyze the cascading effect of availability issues in online systems and extract the corresponding graph based issue representations incorporating both of the issue symptoms and affected service attributes. With the extracted representations, we train and utilize a graph neural networks based model to perform incident detection. Then, for the detected incident, we leverage the PageRank algorithm with a flexible transition matrix design to locate its root cause. We evaluate our approach using real-world data collected from a very large instant messaging company. The results confirm the effectiveness of our approach. Moreover, our approach is successfully deployed in the company and eases the burden of operators in the face of a flood of issues and related alert signals.

Zilong He

Sun Yat-Sen University

Pengfei Chen

Sun Yat-Sen University

Yu Luo

Tencent Inc.

Qiuyu Yan

Tencent Inc.

Hongyang Chen

School of Computer Science and Engineering, Sun Yat-sen University

Guangba Yu

Sun Yat-Sen University

Fangyuan Li

Tencent Inc.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 12 Oct
Displayed time zone: Eastern Time (US & Canada) change

16:00 - 18:00	Technical Session 20 - Web, Cloud, NetworkingJournal-first Papers / Late Breaking Results / Research Papers / Tool Demonstrations / Industry Showcase at Gold A Chair(s): Karine Even-Mendoza Imperial College London

16:00 20m Paper		Mutation-based Analysis of Queueing Network Performance Models -- Journal First Research Journal-first Papers Thomas Laurent Lero & University College Dublin, Paolo Arcaini National Institute of Informatics , Catia Trubiani Gran Sasso Science Institute, Anthony Ventresque University College Dublin & Lero, Ireland Link to publication DOI
16:20 10m Demonstration		WebMonitor: https://youtu.be/hqVw0JU3k9c Tool Demonstrations Ennio Visconti TU Wien, Christos Tsigkanos University of Bern, Switzerland, Laura Nenzi University of Trieste
16:30 20m Research paper		Exploiting Epochs and Symmetries in Analysing MPI Programs Research Papers Rishabh Ranjan IIT Delhi, Ishita Agrawal IIT Delhi, Subodh Sharma IIT Delhi
16:50 20m Paper		MLASP: Machine learning assisted capacity planning Journal-first Papers Arthur Vitui Concordia University, Tse-Hsun (Peter) Chen Concordia University Link to publication DOI
17:10 20m Research paper		Graph based Incident Extraction and Diagnosis in Large-Scale Online SystemsVirtual Research Papers Zilong He Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Yu Luo Tencent Inc., Qiuyu Yan Tencent Inc., Hongyang Chen School of Computer Science and Engineering, Sun Yat-sen University, Guangba Yu Sun Yat-Sen University, Fangyuan Li Tencent Inc.
17:30 10m Paper		ESAVE: Estimating Server and Virtual Machine EnergyVirtual Late Breaking Results Priyavanshi Pathania Accenture Labs, Rohit Mehra Accenture Labs, Vibhu Saujanya Sharma Accenture Labs, Vikrant Kaulgud Accenture Labs, India, Sanjay Podder Accenture, Adam P. Burden Accenture
17:40 20m Industry talk		MCDA Framework for Edge-Aware Multi-Cloud Hybrid Architecture RecommendationVirtual Industry Showcase Manish Ahuja Accenture Labs, Narendranath Sukhavasi Accenture Labs, Swapnajeet Choudhury Accenture Labs, Kaushik Amar Das Accenture Labs, Kapil Singi Accenture, Kuntal Dey Accenture Labs, India, Vikrant Kaulgud Accenture Labs, India