An Empirical Study on Quality Issues of Deep Learning Platform (ICSE 2023 - SEIP - Software Engineering in Practice)

Who

Yanjie Gao, Xiaoxiang Shi, Haoxiang Lin, Hongyu Zhang, Hao Wu, Rui Li, Mao Yang

Track

ICSE 2023 SEIP - Software Engineering in Practice

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 19 May 2023 16:30 - 16:45 at Level G - Plenary Room 1 - Software quality Chair(s): Valentina Lenarduzzi

Abstract

In recent years, deep learning (DL) has been increasingly adopted in many application areas. To help deep learning developers better train and test their models, enterprises have built dedicated, multi-tenant platforms equipped with a mass of computing devices like GPUs. The service quality of these platforms plays a critical role in system efficiency and user experience. Nevertheless, there indeed exist diverse types of quality issues that not only waste computing resources significantly but also slow down development productivity severely. In this paper, we present a comprehensive empirical study on quality issues of Platform-X in Microsoft. Platform-X is an internal production deep learning platform that serves hundreds of developers and researchers. We have manually examined 360 real issues and investigated their common symptoms, root causes, and mitigation actions. Our major findings include: (1) 28.33% of the quality issues are caused by hardware (the GPU, network, and compute node) faults; (2) 28.33% of them result from system-side faults (e.g., system defects and service outages); (3) User-side faults (e.g., user bugs and policy violation) account for more than two-fifths (43.34%) of all the common causes; (4) More than three-fifths of all the quality issues can be mitigated by simply resubmitting jobs (34.72%) and improving user code (24.72%). Our study results provide valuable guidance on promoting the service quality of deep learning platforms from both the development and maintenance aspects. The results further motivate possible research directions and tooling support.

Link to Preprint

https://www.microsoft.com/en-us/research/publication/an-empirical-study-on-quality-issues-of-deep-learning-platform/

Yanjie Gao

Microsoft Research

Xiaoxiang Shi

Haoxiang Lin

Microsoft Research

China

Hongyu Zhang

The University of Newcastle

Australia

Hao Wu

Rui Li

Mao Yang

Microsoft Research

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 19 May
Displayed time zone: Hobart change

15:45 - 17:15	Software qualityJournal-First Papers / NIER - New Ideas and Emerging Results / SEIP - Software Engineering in Practice / Technical Track at Level G - Plenary Room 1 Chair(s): Valentina Lenarduzzi University of Oulu

15:45 15m Talk		DuetCS: Code Style Transfer through Generation and Retrieval Technical Track Binger Chen Technische Universität Berlin, Ziawasch Abedjan Leibniz Universität Hannover
16:00 15m Talk		Understanding Why and Predicting When Developers Adhere to Code-Quality Standards SEIP - Software Engineering in Practice Manish Motwani Georgia Institute of Technology, Yuriy Brun University of Massachusetts Pre-print
16:15 15m Talk		Code Compliance Assessment as a Learning Problem SEIP - Software Engineering in Practice Neela Sawant Amazon, Srinivasan H. Sengamedu Amazon
16:30 15m Talk		An Empirical Study on Quality Issues of Deep Learning Platform SEIP - Software Engineering in Practice Yanjie Gao Microsoft Research, Xiaoxiang Shi , Haoxiang Lin Microsoft Research, Hongyu Zhang The University of Newcastle, Hao Wu , Rui Li , Mao Yang Microsoft Research Pre-print
16:45 7m Talk		Can static analysis tools find more defects? A qualitative study of design rule violations found by code review Journal-First Papers Sahar Mehrpour George Mason University, USA, Thomas LaToza George Mason University
16:52 7m Talk		DebtFree: minimizing labeling cost in self-admitted technical debt identification using semi-supervised learning Journal-First Papers Huy Tu North Carolina State University, USA, Tim Menzies North Carolina State University Link to publication Pre-print
17:00 7m Talk		FIXME: synchronize with database! An empirical study of data access self-admitted technical debt Journal-First Papers Biruk Asmare Muse Polytechnique Montréal, Csaba Nagy Software Institute - USI, Lugano, Anthony Cleve University of Namur, Foutse Khomh Polytechnique Montréal, Giuliano Antoniol Polytechnique Montréal
17:07 7m Talk		How does quality deviate in stable releases by backporting? NIER - New Ideas and Emerging Results Jarin Tasnim University of Saskatchewan, Debasish Chakroborti University of Saskatchewan, Chanchal K. Roy University of Saskatchewan, Kevin Schneider University of Saskatchewan Link to publication Pre-print