Write a Blog >>
ICSE 2021
Sun 16 May - Sat 5 June 2021

This program is tentative and subject to change.

Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur a severe economic loss. Locating the root cause service, i.e., the service where the propagation of anomaly originates, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner, which largely depends on human efforts. A candidate service that directly causes the outage is identified first, and the suspected root cause may be traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production clouds typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first correlation-based outage triage approach by constructing a global view of service correlations. COT mines the correlations of the performance indicators collected from hundreds of services. After learning from historical outages, COT can infer the root cause of emerging ones accordingly. We implement COT and evaluate it on a real-world dataset containing one year of data collected from a production Cloud A, one of the representative cloud computing platforms around the world. Our experimental results show that COT can reach a triage accuracy of 82.1∼83.5%, which outperforms the state-of-the-art triage approach by 28.0∼29.7%.

This program is tentative and subject to change.

Thu 27 May
Times are displayed in time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

15:05 - 16:05
3.3.1. Monitoring Cloud-Based ServicesTechnical Track / SEIP - Software Engineering in Practice at Blended Sessions Room 1 +12h
Chair(s): Andrea ZismanThe Open University
15:05
20m
Paper
Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track
Technical Track
Yaohui WangFudan University, Guozheng LiPeking University, Zijian WangFudan University, Yu KangMicrosoft Research, Beijing, China, Yangfan ZhouFudan University, Hongyu ZhangThe University of Newcastle, Feng GaoMicrosoft Azure, Jeffrey SunMicrosoft Azure, Li YangMicrosoft Azure, Pochian LeeMicrosoft Azure, Zhangwei XuMicrosoft Azure, Pu ZhaoMicrosoft Research, Beijing, China, Bo QiaoMicrosoft Research, Beijing, China, Liqun LiMicrosoft Research, Beijing, China, Xu ZhangMicrosoft Research, Beijing, China, Qingwei LinMicrosoft Research, Beijing, China
Pre-print
15:25
20m
Paper
Neural Knowledge Extraction From Cloud Service IncidentsSEIP
SEIP - Software Engineering in Practice
Manish ShettyMicrosoft Research, India, Chetan BansalMicrosoft Research, Sumit KumarMicrosoft, Nikitha RaoMicrosoft Research, Nachiappan NagappanMicrosoft Research, Thomas ZimmermannMicrosoft Research
Pre-print
15:45
20m
Paper
FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP
SEIP - Software Engineering in Practice
Jinho HwangIBM Research, Larisa ShwartzIBM, Qing WangInstitute of Software, Chinese Academy of Sciences, Raghav BattaIBM, Harshit KumarIBM, Michael NiddIBM
Pre-print

Fri 28 May
Times are displayed in time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

03:05 - 04:05
03:05
20m
Paper
Fast Outage Analysis of Large-scale Production Clouds with Service Correlation MiningTechnical Track
Technical Track
Yaohui WangFudan University, Guozheng LiPeking University, Zijian WangFudan University, Yu KangMicrosoft Research, Beijing, China, Yangfan ZhouFudan University, Hongyu ZhangThe University of Newcastle, Feng GaoMicrosoft Azure, Jeffrey SunMicrosoft Azure, Li YangMicrosoft Azure, Pochian LeeMicrosoft Azure, Zhangwei XuMicrosoft Azure, Pu ZhaoMicrosoft Research, Beijing, China, Bo QiaoMicrosoft Research, Beijing, China, Liqun LiMicrosoft Research, Beijing, China, Xu ZhangMicrosoft Research, Beijing, China, Qingwei LinMicrosoft Research, Beijing, China
Pre-print
03:25
20m
Paper
Neural Knowledge Extraction From Cloud Service IncidentsSEIP
SEIP - Software Engineering in Practice
Manish ShettyMicrosoft Research, India, Chetan BansalMicrosoft Research, Sumit KumarMicrosoft, Nikitha RaoMicrosoft Research, Nachiappan NagappanMicrosoft Research, Thomas ZimmermannMicrosoft Research
Pre-print
03:45
20m
Paper
FIXME: Enhance Software Reliability with Hybrid Approaches in CloudSEIP
SEIP - Software Engineering in Practice
Jinho HwangIBM Research, Larisa ShwartzIBM, Qing WangInstitute of Software, Chinese Academy of Sciences, Raghav BattaIBM, Harshit KumarIBM, Michael NiddIBM
Pre-print

Information for Participants
Info for Blended Sessions Room 1: