ICSE 2024
Fri 12 - Sun 21 April 2024 Lisbon, Portugal

Deep learning plays a critical role in numerous intelligent software applications. Enterprise developers submit and run deep learning jobs on shared, multi-tenant platforms to efficiently train and test models. These platforms are typically equipped with a large number of graphics processing units (GPUs) to expedite deep learning computations. However, certain jobs exhibit rather low utilization of the allocated GPUs, resulting in substantial resource waste and reduced development productivity. This paper presents a comprehensive empirical study on low GPU utilization of deep learning jobs, based on 400 real jobs (with an average GPU utilization of 50% or less) collected from Microsoft’s internal deep learning platform. We discover 706 low-GPU-utilization issues through meticulous examination of job metadata, execution logs, runtime metrics, scripts, and programs. Furthermore, we identify the common root causes and propose corresponding fixes. Our main findings include: (1) Low GPU utilization of deep learning jobs stems from insufficient GPU computations and interruptions caused by non-GPU tasks; (2) Approximately half (46.03%) of the issues are attributed to data operations; (3) 45.18% of the issues are related to deep learning models and manifest during both model training and evaluation stages; (4) Most (84.99%) low-GPU-utilization issues could be fixed with a small number of code/script modifications. Based on the study results, we propose potential research directions that could help developers utilize GPUs better in cloud-based platforms.

Thu 18 Apr

Displayed time zone: Lisbon change

14:00 - 15:30
LLM, NN and other AI technologies 4Research Track / Industry Challenge Track / New Ideas and Emerging Results at Pequeno Auditório
Chair(s): David Nader Palacio William & Mary
14:00
15m
Talk
Programming Assistant for Exception Handling with CodeBERT
Research Track
Yuchen Cai University of Texas at Dallas, Aashish Yadavally University of Texas at Dallas, Abhishek Mishra University of Texas at Dallas, Genesis Montejo University of Texas at Dallas, Tien N. Nguyen University of Texas at Dallas
14:15
15m
Talk
An Empirical Study on Noisy Label Learning for Program Understanding
Research Track
Wenhan Wang Nanyang Technological University, Yanzhou Li Nanyang Technological University, Anran Li Nanyang Technological University, Jian Zhang Nanyang Technological University, Wei Ma Nanyang Technological University, Singapore, Yang Liu Nanyang Technological University
Pre-print
14:30
15m
Talk
An Empirical Study on Low GPU Utilization of Deep Learning Jobs
Research Track
Yanjie Gao Microsoft Research, yichen he , Xinze Li Microsoft Research, Bo Zhao Microsoft Research, Haoxiang Lin Microsoft Research, Yoyo Liang Microsoft, Jing Zhong Microsoft, Hongyu Zhang Chongqing University, Jingzhou Wang Microsoft Research, Yonghua Zeng Microsoft, Keli Gui Microsoft, Jie Tong Microsoft, Mao Yang Microsoft Research
DOI Pre-print
14:45
15m
Talk
Using an LLM to Help With Code Understanding
Research Track
Daye Nam Carnegie Mellon University, Andrew Macvean Google, Inc., Vincent J. Hellendoorn Carnegie Mellon University, Bogdan Vasilescu Carnegie Mellon University, Brad A. Myers Carnegie Mellon University
15:00
15m
Talk
MissConf: LLM-Enhanced Reproduction of Configuration-Triggered Bugs
Industry Challenge Track
Ying Fu National University of Defense Technology, Teng Wang National University of Defense Technology, Shanshan Li National University of Defense Technology, Jinyan Ding National University of Defense Technolog, Shulin Zhou National University of Defense Technology, Zhouyang Jia National University of Defense Technology, Wang Li National University of Defense Technology, Yu Jiang Tsinghua University, Liao Xiangke National University of Defense Technology
File Attached
15:15
7m
Talk
XAIport: A Service Framework for the Early Adoption of XAI in AI Model Development
New Ideas and Emerging Results
Zerui Wang Concordia University, Yan Liu Concordia University, Abishek Arumugam Thiruselvi Concordia University, Wahab Hamou-Lhadj Concordia University, Montreal, Canada
DOI Pre-print
15:22
7m
Talk
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code?
New Ideas and Emerging Results
Alejandro Velasco William & Mary, David Nader Palacio William & Mary, Daniel Rodriguez-Cardenas , Denys Poshyvanyk William & Mary
Pre-print