TraceLingo: Trace representation and learning for performance issue diagnosis in cloud services
Large-scale could systems such as Microsoft Azure, Google Cloud and Amazon AWS provide a wide variety of online services which serve millions of customers around the world. Adverse service behaviors and latencies can have a huge performance impact which affect user satisfaction. Besides monitoring system KPI metrics and log, trace data is of great value for analyzing the system performance status, detecting anomalous workstreams and localizing the performance bottleneck. However, existing work mostly represent the trace as a sequence of events with execution time information, which ignores the runtime context and graph structure of the trace. In this paper, we propose a trace representation and learning model, TraceLingo, which adopts a tree-based RNN model to capture the dependency between spans in various traces for automatic and effective performance diagnosis.
Sat 29 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
11:55 - 12:55 | Technical paper session #1CloudIntelligence 2021 at CloudIntelligence Room Chair(s): Qingwei Lin Microsoft Research, Beijing, China | ||
11:55 15mPaper | PerfEstimator: A Generic and Extensible Performance Estimator for Data Parallel DNN Training CloudIntelligence 2021 Chengru Yang University of Science and Technology of China, Zhehao Li University of Science and Technology of China, Chaoyi Ruan University of Science and Technology of China, Guanbin Xu University of Science and Technology of China, Cheng Li University of Science and Technology of China, Ruichuan Chen Nokia Bell Labs, Feng Yan University of Nevada Reno | ||
12:10 15mPaper | Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF CloudIntelligence 2021 Tianjun Weng Sun Yat-Sen University, Wanqi Yang Sun Yat-Sen University, Guangba Yu Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Jieqi Cui Sun Yat-Sen University, Chuanfu Zhang Sun Yat-Sen University | ||
12:25 15mPaper | TraceLingo: Trace representation and learning for performance issue diagnosis in cloud services CloudIntelligence 2021 Yong Xu Microsoft, China, Yaokang Zhu Microsoft Research Asia, Bo Qiao Microsoft Research, Beijing, China, Hongshu Che Microsoft Research, Beijing, China, Pu Zhao Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Ze Li Microsoft, USA, Yingnong Dang Microsoft, USA, Qingwei Lin Microsoft Research, Beijing, China | ||
12:40 15mPaper | MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems CloudIntelligence 2021 Li Wu Elastisys AB/Technische Universität Berlin, Johan Tordsson Elastisys AB, Jasmin Bogatinovski , Erik Elmroth Elastisys AB/Umea University, Odej Kao Technische Universität Berlin |
Go directly to this room on Clowdr