TCSE logo 
 Sigsoft logo
Sustainability badge

This program is tentative and subject to change.

Fri 2 May 2025 15:00 - 15:15 at 203 - Real-Time SE

Cloud infrastructure is the collective term for all physical devices within cloud systems. Failures within the cloud infrastructure system can severely compromise the stability and availability of cloud services. Particularly, batch servers outage, which is the most fatal failure, could result in the complete unavailability of all upstream services. In this work, we focus on the batch servers outage diagnosis problem, aiming to accurately and promptly analyze the root cause of outages to facilitate troubleshooting. However, our empirical study conducted in a real industrial system indicates that it is a challenging task. Firstly, the collected single-modal coarse-grained failure monitoring data (i.e., alert, incident, or change) in the cloud infrastructure system is insufficient for a comprehensive failure profiling. Secondly, due to the intricate dependencies among devices, outages are often the cumulative result of multiple failures, but correlations between failures are difficult to ascertain. To address these problems, we propose BSODiag, an unsupervised and lightweight framework designed for diagnosing batch servers outage. BSODiag provides a global analytical perspective, thoroughly explores failures information from multi-source monitoring data, models the spatio-temporal correlations among failures, and delivers accurate and interpretable diagnostic results. Experiments conducted on the Alibaba Cloud infrastructure system show that BSODiag achieves 87.5% PR@3 and 46.3% PCR, outperforming baseline methods by 10.2% and 3.7%, respectively.

This program is tentative and subject to change.

Fri 2 May

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30
14:00
15m
Talk
Closing the Gap between Sensor Inputs and Driving Properties: A Scene Graph Generator for CARLA
Demonstrations
Trey Woodlief University of Virginia, Felipe Toledo , Sebastian Elbaum University of Virginia, Matthew B Dwyer University of Virginia
14:15
15m
Talk
LEGOS-SLEEC: Tool for Formalizing and Analyzing Normative Requirements
Demonstrations
Kevin Kolyakov University of Toronto, Lina Marsso École Polytechnique de Montréal, Nick Feng University of Toronto, Junwei Quan University of Toronto, Marsha Chechik University of Toronto
14:30
15m
Talk
MarMot: Metamorphic Runtime Monitoring of Autonomous Driving Systems
Journal-first Papers
Jon Ayerdi Mondragon University, Asier Iriarte Mondragon University, Pablo Valle Mondragon University, Ibai Roman Mondragon University, Miren Illarramendi Mondragon University, Aitor Arrieta Mondragon University
14:45
15m
Talk
Automatically Generating Content for Testing Autonomous Vehicles from User Descriptions
New Ideas and Emerging Results (NIER)
Benedikt Steininger IMC FH Krems, Chrysanthi Papamichail BeamNG GmbH, David Stark BeamNG GmbH, Dejan Nickovic Austrian Institute of Technology, Alessio Gambi Austrian Institute of Technology (AIT)
15:00
15m
Talk
BSODiag: A Global Diagnosis Framework for Batch Servers Outage in Large-scale Cloud Infrastructure Systems
SE In Practice (SEIP)
Tao Duan Xi'an Jiaotong University, Runqing Chen Alibaba, Pinghui Wang Xi'an Jiaotong University, Junzhou Zhao Xi'an Jiaotong University, Jiongzhou Liu Alibaba, Shujie Han Northwestern Polytechnical University, Yi Liu Alibaba, Fan Xu Alibaba
15:15
15m
Talk
On Large Language Models in Mission-Critical IT Governance: Are We Ready Yet?
SE In Practice (SEIP)
Matteo Esposito University of Oulu, Francesco Palagiano Multitel di Lerede Alessandro & C. s.a.s., Valentina Lenarduzzi University of Oulu, Davide Taibi University of Oulu
DOI Pre-print
:
:
:
: