An Empirical Study on Quality Issues of Deep Learning Platform
In recent years, deep learning (DL) has been increasingly adopted in many application areas. To help deep learning developers better train and test their models, enterprises have built dedicated, multi-tenant platforms equipped with a mass of computing devices like GPUs. The service quality of these platforms plays a critical role in system efficiency and user experience. Nevertheless, there indeed exist diverse types of quality issues that not only waste computing resources significantly but also slow down development productivity severely. In this paper, we present a comprehensive empirical study on quality issues of Platform-X in Microsoft. Platform-X is an internal production deep learning platform that serves hundreds of developers and researchers. We have manually examined 360 real issues and investigated their common symptoms, root causes, and mitigation actions. Our major findings include: (1) 28.33% of the quality issues are caused by hardware (the GPU, network, and compute node) faults; (2) 28.33% of them result from system-side faults (e.g., system defects and service outages); (3) User-side faults (e.g., user bugs and policy violation) account for more than two-fifths (43.34%) of all the common causes; (4) More than three-fifths of all the quality issues can be mitigated by simply resubmitting jobs (34.72%) and improving user code (24.72%). Our study results provide valuable guidance on promoting the service quality of deep learning platforms from both the development and maintenance aspects. The results further motivate possible research directions and tooling support.