ASE 2023
Mon 11 - Fri 15 September 2023 Kirchberg, Luxembourg
Thu 14 Sep 2023 15:55 - 16:08 at Room D - Configuration and Version Management Chair(s): Shahar Maoz

Modern cloud services are prone to failures due to their complex architecture, making diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging multiple sources of data, including the alerts, error logs, and domain expertise through past experiences to locate the root cause(s). These experiences are documented as natural language text in outage reports for previous outages. However, utilizing the raw yet rich semi-structured information in the reports systematically is time-consuming. Structured information, on the other hand, such as alerts that are often used during fault diagnosis, is voluminous and requires expert knowledge to discern. Several strategies have been proposed to use each source of data separately for root cause analysis. In this work, we build a diagnostic service called ESRO that recommends root causes and remediation for failures by utilizing structured as well as semi-structured sources of data systematically. ESRO constructs a causal graph using alerts and a knowledge graph using outage reports, and merges them in a novel way to form a unified graph during training. A retrieval-based mechanism is then used to search the unified graph and rank the likely root causes and remediation techniques based on the alerts fired during an outage at inference time. Not only the individual alerts, but their respective importance in predicting an outage group is taken into account during recommendation. We evaluated our model on several cloud service outages of a large SaaS enterprise over the course of ∼2 years, and obtained an average improvement of 27% in rouge scores after comparing the likely root causes against the ground truth over state-of-the-art baselines. We further establish the effectiveness of ESRO through qualitative analysis on multiple real outage examples.

ESRO_ppt (Ase 2023 Esro.pptx)2.93MiB
ESRO: EXPERIENCE ASSISTED SERVICE RELIABILITY AGAINST OUTAGES (ESRO_ASE23_paper.pdf)6.8MiB

Thu 14 Sep

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

15:30 - 17:00
Configuration and Version ManagementResearch Papers at Room D
Chair(s): Shahar Maoz Tel Aviv University
15:30
12m
Talk
A Large-Scale Empirical Study on Semantic Versioning in Golang Ecosystem
Research Papers
Wenke Li Huazhong University of Science and Technology, Feng Wu Tencent Technology (Shenzhen) Co. Ltd, Cai Fu Huazhong University of Science and Technology, Fan Zhou Tencent Technology (Shenzhen) Co. Ltd
Link to publication DOI Pre-print
15:42
12m
Talk
Where to Go Now? Finding Alternatives for Declining Packages in the npm Ecosystem
Research Papers
Suhaib Mujahid Mozilla, Diego Costa Concordia University, Canada, Rabe Abdalkareem Omar Al-Mukhtar University, Emad Shihab Concordia Univeristy
Pre-print
15:55
12m
Talk
ESRO: Experience Assisted Service Reliability against Outages
Research Papers
Sarthak Chakraborty Adobe Research, Shubham Agarwal Adobe Research, Shaddy Garg Adobe, Abhimanyu Sethia Indian Institute of Technology Kanpur, Udit Narayan Pandey Indian Institute of Technology Kanpur, Videh Aggarwal Indian Institute of Technology Kanpur, Shiv Saini Adobe Research
File Attached
16:08
12m
Talk
Fixing Privilege Escalations in Cloud Access Control with MaxSAT and Graph Neural Networks
Research Papers
Yang Hu University of Texas at Austin, Wenxi Wang University of Texas at Austin, Sarfraz Khurshid University of Texas at Austin, Kenneth L. McMillan University of Texas at Austin, Mohit Tiwari University of Texas at Austin
File Attached
16:21
12m
Talk
Merge Conflict Resolution: Classification or Generation?
Research Papers
Jinhao Dong Peking University, Qihao Zhu Peking University, Zeyu Sun Zhongguancun Laboratory, Yiling Lou Fudan University, Dan Hao Peking University
Pre-print File Attached
16:34
12m
Talk
Repeated Builds During Code Review: An Empirical Study of the OpenStack Community
Research Papers
Rungroj Maipradit University of Waterloo, Dong Wang Kyushu University, Japan, Patanamon Thongtanunam University of Melbourne, Raula Gaikovina Kula Nara Institute of Science and Technology, Yasutaka Kamei Kyushu University, Shane McIntosh University of Waterloo
Pre-print File Attached
16:47
12m
Talk
Automated Software Entity Matching Between Successive VersionsRecorded talk
Research Papers
Bo Liu Beijing Institute of Technology, Hui Liu Beijing Institute of Technology, Nan Niu University of Cincinnati, Yuxia Zhang Beijing Institute of Technology, Guangjie Li National Innovation Institute of Defense Technology, Yanjie Jiang Beijing Institute of Technology
DOI Media Attached