Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations
Detecting failures early in cloud-based software systems is highly significant as it can reduce operational costs, enhance service reliability, and improve user experience. Many existing approaches include anomaly detection in metrics or a blend of metric and log features. However, such approaches tend to be very complex and hardly explainable, and consequently non-trivial for implementation and evaluation in industrial contexts. In collaboration with a case company and their cloud-based system in the domain of PIM (Product Information Management), we propose and implement autonomous monitors for proactive monitoring across multiple services of distributed software architecture, fused with anomaly detection in performance metrics and log analysis using GPT-3. We demonstrated that operations engineers tend to be more efficient by having access to interpretable alert notifications based on detected anomalies that contain information about implications and potential root causes. Additionally, proposed autonomous monitors turned out to be beneficial for the timely identification and revision of potential issues before they propagate and cause severe consequences.
Wed 17 AprDisplayed time zone: Lisbon change
14:00 - 15:30 | Dependability and Formal methods 1Software Engineering in Practice / Demonstrations / Research Track at Maria Helena Vieira da Silva Chair(s): Domenico Bianculli University of Luxembourg | ||
14:00 15mTalk | REDriver: Runtime Enforcement for Autonomous Vehicles Research Track Yang Sun Singapore Management University, Chris Poskitt Singapore Management University, Xiaodong Zhang , Jun Sun Singapore Management University Pre-print | ||
14:15 15mTalk | Scalable Relational Analysis via Relational Bound Propagation Research Track DOI Pre-print | ||
14:30 15mTalk | Kind Controllers and Fast Heuristics for Non-Well-Separated GR(1) Specifications Research Track Ariel Gorenstein Tel Aviv University, Shahar Maoz Tel Aviv University, Jan Oliver Ringert Bauhaus-University Weimar | ||
14:45 15mTalk | On the Difficulty of Identifying Incident-Inducing Changes Software Engineering in Practice Eileen Kapel ING & Delft University of Technology, Luís Cruz Delft University of Technology, Diomidis Spinellis Athens University of Economics and Business & Delft University of Technology, Arie van Deursen Delft University of Technology | ||
15:00 15mTalk | Autonomous Monitors for Detecting Failures Early and Reporting Interpretable Alerts in Cloud Operations Software Engineering in Practice Adha Hrusto Lund University, Sweden, Per Runeson Lund University, Magnus C Ohlsson System Verification | ||
15:15 7mTalk | nvshare: Practical GPU Sharing without Memory Size Constraints Demonstrations Pre-print | ||
15:22 7mTalk | Daedalux: An Extensible Platform for Variability-Aware Model Checking Demonstrations Sami Lazreg Visteon Electronics and Universite Cote d Azur, Maxime Cordy University of Luxembourg, Luxembourg, Simon Thrane Hansen SnT, University of Luxembourg, Axel Legay Université Catholique de Louvain, Belgium |