L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis (FSE 2025 - Industry Papers)

Who

Zhihan Jiang, Junjie Huang, Guangba Yu, Zhuangbin Chen, Yichen LI, Renyi Zhong, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael Lyu

Track

FSE 2025 Industry Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 24 Jun 2025 17:20 - 17:40 at Pirsenteret 150 - Anomaly Detection Chair(s): Gias Uddin

Abstract

As Large Language Models (LLMs) show their capabilities across various applications, training customized LLMs has become essential for modern enterprises. However, due to the complexity of LLM training, which requires massive computational resources and extensive training time, failures are inevitable during the training process. These failures result in considerable waste of resource and time, highlighting the critical need for effective and efficient failure diagnosis to reduce the cost of LLM training.

In this paper, we present the first empirical study on the failure reports of 428 LLM training failures in our production Platform-X between May 2023 and April 2024. Our study reveals that hardware and user faults are the predominant root causes, and current diagnosis processes rely heavily on training logs. Unfortunately, existing log-based diagnostic methods fall short in handling LLM training logs. Considering the unique features of LLM training, we identify three distinct patterns of LLM training logs: cross-job, spatial, and temporal patterns. We then introduce our Log-based Large-scale LLM training failure diagnosis framework, L4, which can automatically extract failure-indicating information (i.e., log events, nodes, stages, and iterations) from extensive training logs, thereby reducing manual effort and facilitating failure recovery. Experimental results on real-world datasets show that L4 outperforms existing approaches in identifying failure-indicating logs and localizing faulty nodes. Furthermore, L4 has been applied in Platform-X and demonstrated its effectiveness in enabling accurate and efficient failure diagnosis.

Zhihan Jiang

The Chinese University of Hong Kong

Junjie Huang

The Chinese University of Hong Kong

Hong Kong SAR China

Guangba Yu

The Chinese University of Hong Kong

Hong Kong SAR China

Zhuangbin Chen

Sun Yat-sen University

China

Yichen LI

The Chinese University of Hong Kong

China

Renyi Zhong

The Chinese University of Hong Kong

Cong Feng

Huawei Cloud Computing Technology

China

Yongqiang Yang

Huawei Cloud Computing Technology

China

Zengyin Yang

Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd

Michael Lyu

Chinese University of Hong Kong

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 24 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

16:00 - 17:40	Anomaly DetectionIdeas, Visions and Reflections / Research Papers / Industry Papers at Pirsenteret 150 Chair(s): Gias Uddin York University, Canada

16:00 20m Talk		Cross-System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning Research Papers Yuqing Wang University of Helsinki, Finland, Mika Mäntylä University of Helsinki and University of Oulu, Serge Demeyer University of Antwerp and Flanders Make vzw, Mutlu Beyazıt University of Antwerp and Flanders Make vzw, Joanna Kisaakye University of Antwerp, Belgium, Jesse Nyyssölä University of Helsinki DOI
16:20 10m Talk		CLSLog: Collaborating Large and Small Models for Log-based Anomaly Detection Ideas, Visions and Reflections Pei Xiao Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Chiming Duan Peking University, Minghua He Peking University, Weijie Hong Peking university, Xixuan Yang School of Software and Microelectronics, Peking University, Yihan Wu National Computer Network Emergency Response Technical Team/Coordination Center of China, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University
16:30 10m Talk		From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning Ideas, Visions and Reflections Xinlong Zhao Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Minghua He Peking University, Yihan Wu National Computer Network Emergency Response Technical Team/Coordination Center of China, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University
16:40 20m Talk		CAShift: Benchmarking Log-Based Cloud Attack Detection under Normality Shift Research Papers Jiongchi Yu Singapore Management University, Xiaofei Xie Singapore Management University, Qiang Hu Tianjin University, Bowen Zhang Singapore Management University, Ziming Zhao Zhejiang University, Yun Lin Shanghai Jiao Tong University, Lei Ma The University of Tokyo & University of Alberta, Ruitao Feng Southern Cross University, Frank Liauw Government Technology Agency Singapore DOI Pre-print
17:00 20m Talk		Detecting and Handling WoT Violations by Learning Physical Interactions from Device Logs Research Papers Bingkun Sun Fudan University, Shiqi Sun Northwestern Polytechnique University, Jialin Ren Fudan University, Mingming Hu Fudan University, Kun Hu School of Computer Science, Fudan University, Liwei Shen Fudan University, Xin Peng Fudan University DOI
17:20 20m Talk		L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis Industry Papers Zhihan Jiang The Chinese University of Hong Kong, Junjie Huang The Chinese University of Hong Kong, Guangba Yu The Chinese University of Hong Kong, Zhuangbin Chen Sun Yat-sen University, Yichen LI The Chinese University of Hong Kong, Renyi Zhong The Chinese University of Hong Kong, Cong Feng Huawei Cloud Computing Technology, Yongqiang Yang Huawei Cloud Computing Technology, Zengyin Yang Computing and Networking Innovation Lab, Huawei Cloud Computing Technology Co., Ltd, Michael Lyu Chinese University of Hong Kong

Information for Participants

Tue 24 Jun 2025 16:00 - 17:40 at Pirsenteret 150 - Anomaly Detection Chair(s): Gias Uddin

Info for room Pirsenteret 150:

This room is located outside Clarion Hotel

This room is located in the Pirsenteret (The Pier Center) convention center. It is just outside the hotel, on the back, towards the fjord.

You should be able to go through the emergency exit at Clarion, just on the side of the Cosmos 3 wing, which will be bring you close to Pirsenteret.

The entrance to the center is from here:
https://maps.app.goo.gl/dU3qH6kAimXGBNHe7
Once inside, go all straight and you will find signage to reach the room. The room is known as room 150 inside the center.