In recent years, deep learning has seen widespread adoption across various domains, giving rise to large-scale models such as large language models. Training these models, particularly in distributed environments, presents substantial computational and communication challenges. A critical issue is the communication deadlock—a state in which processes become indefinitely stalled while awaiting network messages from others, which leads to resource wastage and reduced productivity. Current approaches to deadlock handling are either unsuitable for deep learning due to its unique hybrid programming paradigm or limit optimization opportunities. This paper presents dl², a novel dynamic analysis tool designed to detect communication deadlocks in deep learning jobs. dl² models the runtime trace of a job as an execution graph, detects unmatched communications, and constructs a wait-for graph to identify deadlock cycles. dl² can also handle nondeterministic communication behaviors, providing replay and diagnostic support for root cause analysis. We evaluate dl² using PyTorch with a combination of synthetic test cases and real-world deep learning workloads. The experimental results show that dl² successfully detects all communication deadlocks, achieving 100% precision and recall, which highlights its effectiveness.
Mon 23 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
10:30 - 12:20 | Bug DetectionResearch Papers / Industry Papers / Demonstrations / Journal First at Aurora B Chair(s): Lingming Zhang University of Illinois at Urbana-Champaign | ||
10:30 20mTalk | Yuga: Automatically Detecting Lifetime Annotation Bugs in the Rust Language Journal First Vikram Nitin Columbia University, Anne Mulhern Red Hat Inc, Sanjay Arora Red Hat Inc, Baishakhi Ray Columbia University | ||
10:50 10mTalk | SpecChecker-Int: An Extensible Concurrency Bugs Detection Tool for Interrupt-driven Embedded Software Demonstrations Boxiang Wang Beijing Sunwise Information Technology Ltd, Chao Li Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Rui Chen Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Sheng Wang Beijing Sunwise Information Technology Ltd, Chunpeng Jia Beijing Sunwise Information Technology Ltd, Mengfei Yang China Academy of Space Technology | ||
11:00 20mTalk | dl²: Detecting Communication Deadlocks in Deep Learning Jobs Industry Papers Yanjie Gao Microsoft Research, Jiyu Luo University of Science and Technology of China, Haoxiang Lin Microsoft Research, Hongyu Zhang Chongqing University, Ming Wu Zero Gravity Labs, Mao Yang Microsoft Research DOI Pre-print | ||
11:20 20mTalk | Detecting Metadata-Related Bugs in Enterprise Applications Research Papers Md Mahir Asef Kabir Virginia Tech, Xiaoyin Wang University of Texas at San Antonio, Na Meng Virginia Tech DOI | ||
11:40 20mTalk | ROSCallBaX: Statically Detecting Inconsistencies In Callback Function Setup of Robotic Systems Research Papers Sayali Kate Purdue University, Yifei Gao Purdue University, Shiwei Feng Purdue University, Xiangyu Zhang Purdue University DOI | ||
12:00 20mTalk | Enhancing Web Accessibility: Automated Detection of Issues with Generative AI Research Papers Ziyao He University of California, Irvine, Syed Fatiul Huq University of California, Irvine, Sam Malek University of California at Irvine DOI |
Aurora B is the second room in the Aurora wing.
When facing the main Cosmos Hall, access to the Aurora wing is on the right, close to the side entrance of the hotel.