Write a Blog >>
ICSE 2021
Mon 17 May - Sat 5 June 2021

Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur a severe economic loss. Locating the root cause service, i.e., the service where the propagation of anomaly originates, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner, which largely depends on human efforts. A candidate service that directly causes the outage is identified first, and the suspected root cause may be traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production clouds typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first correlation-based outage triage approach by constructing a global view of service correlations. COT mines the correlations of the performance indicators collected from hundreds of services. After learning from historical outages, COT can infer the root cause of emerging ones accordingly. We implement COT and evaluate it on a real-world dataset containing one year of data collected from a production Cloud A, one of the representative cloud computing platforms around the world. Our experimental results show that COT can reach a triage accuracy of 82.1∼83.5%, which outperforms the state-of-the-art triage approach by 28.0∼29.7%.

Thu 27 May

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

15:05 - 16:05
3.3.1. Monitoring Cloud-Based ServicesTechnical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 +12h
Chair(s): Andrea Zisman The Open University
15:05
20m
Paper
Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track
Technical Track
Yaohui Wang Fudan University, Guozheng Li Peking University, Zijian Wang Fudan University, Yu Kang Microsoft Research, Beijing, China, Yangfan Zhou Fudan University, Hongyu Zhang The University of Newcastle, Feng Gao Microsoft Azure, Jeffrey Sun Microsoft Azure, Li Yang Microsoft Azure, Pochian Lee Microsoft Azure, Zhangwei Xu Microsoft Azure, Pu Zhao Microsoft Research, Beijing, China, Bo Qiao Microsoft Research, Beijing, China, Liqun Li Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Qingwei Lin Microsoft Research, Beijing, China
Pre-print Media Attached
15:25
20m
Paper
Neural Knowledge Extraction From Cloud Service IncidentsSEIP
SEIP - Software Engineering in Practice
Manish Shetty Microsoft Research, India, Chetan Bansal Microsoft Research, Sumit Kumar Microsoft, Nikitha Rao Microsoft Research, Nachiappan Nagappan Microsoft Research, Thomas Zimmermann Microsoft Research
Link to publication DOI Pre-print Media Attached
15:45
20m
Paper
FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP
SEIP - Software Engineering in Practice
Jinho Hwang IBM Research, Larisa Shwartz IBM, Qing Wang Institute of Software, Chinese Academy of Sciences, Raghav Batta IBM, Harshit Kumar IBM, Michael Nidd IBM
Pre-print Media Attached

Fri 28 May

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

03:05 - 04:05
03:05
20m
Paper
Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track
Technical Track
Yaohui Wang Fudan University, Guozheng Li Peking University, Zijian Wang Fudan University, Yu Kang Microsoft Research, Beijing, China, Yangfan Zhou Fudan University, Hongyu Zhang The University of Newcastle, Feng Gao Microsoft Azure, Jeffrey Sun Microsoft Azure, Li Yang Microsoft Azure, Pochian Lee Microsoft Azure, Zhangwei Xu Microsoft Azure, Pu Zhao Microsoft Research, Beijing, China, Bo Qiao Microsoft Research, Beijing, China, Liqun Li Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Qingwei Lin Microsoft Research, Beijing, China
Pre-print Media Attached
03:25
20m
Paper
Neural Knowledge Extraction From Cloud Service IncidentsSEIP
SEIP - Software Engineering in Practice
Manish Shetty Microsoft Research, India, Chetan Bansal Microsoft Research, Sumit Kumar Microsoft, Nikitha Rao Microsoft Research, Nachiappan Nagappan Microsoft Research, Thomas Zimmermann Microsoft Research
Link to publication DOI Pre-print Media Attached
03:45
20m
Paper
FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP
SEIP - Software Engineering in Practice
Jinho Hwang IBM Research, Larisa Shwartz IBM, Qing Wang Institute of Software, Chinese Academy of Sciences, Raghav Batta IBM, Harshit Kumar IBM, Michael Nidd IBM
Pre-print Media Attached