ICSE 2024
Fri 12 - Sun 21 April 2024 Lisbon, Portugal

Postmortem analysis is essential in the management of incidents within cloud systems, which provides valuable insights to improve system’s reliability and robustness. At CloudA, fault pattern profiling is performed during the postmortem phase, which involves the classification faults of the incident into unique categories, dubbed fault pattern. By aggregating and analyzing these fault patterns, engineers can discern common faults, vulnerable components and prevailing trends. However, this process is currently conducted by manual labeling, which has inherent drawbacks. On the one hand, the sheer volume of incidents means only the most severe ones are analyzed, causing a skewed overview of fault patterns. On the other hand, the complexity of the task demands extensive domain knowledge, which leads to errors and inconsistencies. To address these limitations, we propose an automated approach, named FaultProfIT, for Fault pattern Profiling of Incident Tickets. It leverages hierarchy-guided contrastive learning to train a hierarchyaware incident encoder and predicts fault patterns with enhanced incident representations. We evaluate FaultProfIT using the production incidents from CloudA. The results demonstrate that FaultProfIT outperforms state-of-the-art methods. Our ablation study and analysis also verify the effectiveness of hierarchy-guided contrastive learning. Additionally, we have deployed FaultProfIT at CloudA for six months. To date, FaultProfIT has analyzed 10,000+ incidents from 30+ cloud services, successfully revealing several fault trends that have informed system improvements.

Fri 19 Apr

Displayed time zone: Lisbon change

14:00 - 15:30
14:00
15m
Talk
It's Not a Feature, It's a Bug: Fault-Tolerant Model Mining from Noisy Data
Research Track
Felix Wallner Graz University of Technology, Institute of Software Technology, Bernhard Aichernig Graz University of Technology, Christian Burghard AVL List GmbH
Link to publication DOI
14:15
15m
Talk
Verifying Declarative Smart Contracts
Research Track
Haoxian Chen ShanghaiTech University, Lan Lu University of Pennsylvania, Brendan Massey University of Pennsylvania, Yuepeng Wang Simon Fraser University, Boon Thau Loo University of Pennsylvania
14:30
15m
Talk
Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
Software Engineering in Practice
Jinxi Kuang The Chinese University of Hong Kong, Jinyang Liu The Chinese University of Hong Kong, Junjie Huang The Chinese University of Hong Kong, Renyi Zhong The Chinese University of Hong Kong, Jiazhen Gu The Chinese University of Hong Kong, Lan Yu Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Rui Tan Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Michael Lyu The Chinese University of Hong Kong
14:45
15m
Talk
Intelligent Monitoring Framework for Cloud Services: A Data-Driven Approach
Software Engineering in Practice
Pooja Srinivas Microsoft, Fiza Husain Microsoft, Anjaly Parayil Microsoft, Ayush Choure Microsoft, Chetan Bansal Microsoft Research, Saravan Rajmohan Microsoft
15:00
15m
Talk
FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud Systems
Software Engineering in Practice
Junjie Huang The Chinese University of Hong Kong, Jinyang Liu The Chinese University of Hong Kong, Zhuangbin Chen School of Software Engineering, Sun Yat-sen University, Zhihan Jiang The Chinese University of Hong Kong, Yichen LI The Chinese University of Hong Kong, Jiazhen Gu The Chinese University of Hong Kong, Cong Feng Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Yongqiang Yang Huawei Technologies, Michael Lyu The Chinese University of Hong Kong
15:15
7m
Talk
Translating between SQL Dialects for Cloud Migration
Software Engineering in Practice
Ran Zmigrod JP Morgan - Chase, Salwa Alamir J.P. Morgan AI Research, Xiaomo Liu JP Morgan AI Research
15:22
7m
Talk
Designing Trustful Cooperation Ecosystems is Key to the New Space Exploration Era
New Ideas and Emerging Results
Renan Lima Baima University of Luxembourg, Loïck Chovet University of Luxembourg, Johannes Sedlmeir University of Luxembourg, Miguel A. Olivares-Mendez University of Luxembourg, Gilbert Fridgen University of Luxembourg