CloudIntelligence 2021
Sat 29 May 2021
co-located with ICSE 2021
Sat 29 May 2021 11:55 - 12:10 at CloudIntelligence Room - Technical paper session #1 Chair(s): Qingwei Lin

Understanding the performance of data parallel DNN training at large-scale is crucial for supporting efficient DNN cloud deployment as well as facilitating the design and optimization of scalable DNN systems. Existing works adopt analytical modeling, which may fall short in capturing the system behaviors resulting from the fast evolving DNN systems and constantly proposed optimizations. In this paper, we present PerfEstimator, a generic and extensible estimator for accurate performance estimation of large-scale data parallel DNN training. PerfEstimator is driven by three major components, namely, an extensible attributed graph based performance model, a computation and synchronization profiling and simulating tool for obtaining runtime time costs on a single machine, and a computation-synchronization pipeline builder to derive the scaling factors. Our evaluation highlights that PerfEstimator can accurately predict the performance of data parallel DNN training jobs with a prediction error of 0.2-11%.

Sat 29 May

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:55 - 12:55
Technical paper session #1CloudIntelligence 2021 at CloudIntelligence Room
Chair(s): Qingwei Lin Microsoft Research, Beijing, China
11:55
15m
Paper
PerfEstimator: A Generic and Extensible Performance Estimator for Data Parallel DNN Training
CloudIntelligence 2021
Chengru  Yang University of Science and Technology of China, Zhehao  Li University of Science and Technology of China, Chaoyi  Ruan University of Science and Technology of China, Guanbin  Xu University of Science and Technology of China, Cheng  Li University of Science and Technology of China, Ruichuan  Chen Nokia Bell Labs, Feng Yan University of Nevada Reno
12:10
15m
Paper
Kmon: An In-kernel Transparent Monitoring System for Microservice Systems with eBPF
CloudIntelligence 2021
Tianjun Weng Sun Yat-Sen University, Wanqi  Yang Sun Yat-Sen University, Guangba  Yu Sun Yat-Sen University, Pengfei Chen Sun Yat-Sen University, Jieqi Cui Sun Yat-Sen University, Chuanfu  Zhang Sun Yat-Sen University
12:25
15m
Paper
TraceLingo: Trace representation and learning for performance issue diagnosis in cloud services
CloudIntelligence 2021
Yong Xu Microsoft, China, Yaokang  Zhu Microsoft Research Asia, Bo Qiao Microsoft Research, Beijing, China, Hongshu  Che Microsoft Research, Beijing, China, Pu Zhao Microsoft Research, Beijing, China, Xu Zhang Microsoft Research, Beijing, China, Ze Li Microsoft, USA, Yingnong Dang Microsoft, USA, Qingwei Lin Microsoft Research, Beijing, China
12:40
15m
Paper
MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems
CloudIntelligence 2021
Li Wu Elastisys AB/Technische Universität Berlin, Johan Tordsson Elastisys AB, Jasmin Bogatinovski , Erik  Elmroth  Elastisys AB/Umea University, Odej  Kao Technische Universität Berlin

Information for Participants
Sat 29 May 2021 11:55 - 12:55 at CloudIntelligence Room - Technical paper session #1 Chair(s): Qingwei Lin
Info for room CloudIntelligence Room:

Go directly to this room on Clowdr