L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis
As Large Language Models (LLMs) show their capabilities across various applications, training customized LLMs has become essential for modern enterprises. However, due to the complexity of LLM training, which requires massive computational resources and extensive training time, failures are inevitable during the training process. These failures result in considerable waste of resource and time, highlighting the critical need for effective and efficient failure diagnosis to reduce the cost of LLM training.
In this paper, we present the first empirical study on the failure reports of 428 LLM training failures in our production Platform-X between May 2023 and April 2024. Our study reveals that hardware and user faults are the predominant root causes, and current diagnosis processes rely heavily on training logs. Unfortunately, existing log-based diagnostic methods fall short in handling LLM training logs. Considering the unique features of LLM training, we identify three distinct patterns of LLM training logs: cross-job, spatial, and temporal patterns. We then introduce our Log-based Large-scale LLM training failure diagnosis framework, L4, which can automatically extract failure-indicating information (i.e., log events, nodes, stages, and iterations) from extensive training logs, thereby reducing manual effort and facilitating failure recovery. Experimental results on real-world datasets show that L4 outperforms existing approaches in identifying failure-indicating logs and localizing faulty nodes. Furthermore, L4 has been applied in Platform-X and demonstrated its effectiveness in enabling accurate and efficient failure diagnosis.
Tue 24 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
16:00 - 17:40 | Anomaly DetectionIdeas, Visions and Reflections / Research Papers / Industry Papers at Pirsenteret 150 Chair(s): Gias Uddin York University, Canada | ||
16:00 20mTalk | Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning Research Papers Yuqing Wang University of Helsinki, Finland, Mika Mäntylä University of Helsinki and University of Oulu, Serge Demeyer University of Antwerp and Flanders Make vzw, Mutlu Beyazıt University of Antwerp and Flanders Make vzw, Joanna Kisaakye University of Antwerp, Belgium, Jesse Nyyssölä University of Helsinki DOI | ||
16:20 10mTalk | CLSLog: Collaborating Large and Small Models for Log-based Anomaly Detection Ideas, Visions and Reflections Pei Xiao Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Chiming Duan Peking University, Minghua He Peking University, Weijie Hong Peking university, Xixuan Yang School of Software and Microelectronics, Peking University, Yihan Wu National Computer Network Emergency Response Technical Team/Coordination Center of China, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University | ||
16:30 10mTalk | From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning Ideas, Visions and Reflections Xinlong Zhao Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Minghua He Peking University, Yihan Wu National Computer Network Emergency Response Technical Team/Coordination Center of China, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University | ||
16:40 20mTalk | CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift Research Papers Jiongchi Yu Singapore Management University, Xiaofei Xie Singapore Management University, Qiang Hu Tianjin University, Bowen Zhang Singapore Management University, Ziming Zhao Zhejiang University, Yun Lin Shanghai Jiao Tong University, Lei Ma The University of Tokyo & University of Alberta, Ruitao Feng Southern Cross University, Frank Liauw Government Technology Agency Singapore DOI Pre-print | ||
17:00 20mTalk | Detecting and Handling WoT Violations by Learning Physical Interactions from Device Logs Research Papers Bingkun Sun Fudan University, Shiqi Sun Northwestern Polytechnique University, Jialin Ren Fudan University, Mingming Hu Fudan University, Kun Hu School of Computer Science, Fudan University, Liwei Shen Fudan University, Xin Peng Fudan University DOI | ||
17:20 20mTalk | L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis Industry Papers Zhihan Jiang The Chinese University of Hong Kong, Junjie Huang The Chinese University of Hong Kong, Guangba Yu The Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Yichen LI The Chinese University of Hong Kong, Renyi Zhong The Chinese University of Hong Kong, Cong Feng Huawei Cloud Computing Technology, Yongqiang Yang Huawei Cloud Computing Technology, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Michael Lyu Chinese University of Hong Kong |
This room is located outside Clarion Hotel
This room is located in the Pirsenteret (The Pier Center) convention center. It is just outside the hotel, on the back, towards the fjord.
You should be able to go through the emergency exit at Clarion, just on the side of the Cosmos 3 wing, which will be bring you close to Pirsenteret.
The entrance to the center is from here:
https://maps.app.goo.gl/dU3qH6kAimXGBNHe7
Once inside, go all straight and you will find signage to reach the room. The room is known as room 150 inside the center.