Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet (Internetware 2025 - Research Track)

Who

Zhenyu Zhang, Pengfei Chen, Guangba Yu, Zilong He, Xiaoyun Li

Track

Internetware 2025 Research Track

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 21 Jun 2025 11:00 - 11:15 at Cosmos 3A - Session4: Code Optimization and Software Architecture Chair(s): Changhai Nie

Abstract

As deep neural networks (DNN) continue to scale up in size to achieve greater capabilities, distributed training (DT) has become the prevailing approach to accelerate the training process. However, according to our observation on the network communication overheads in DT within Ethernet, the Linux kernel network stack accounts for 30% to 40% of the total communication time, posing a significant bottleneck to training efficiency. To mitigate the overhead introduced by the kernel network stack, we propose eRAR, an eBPF-based gradient aggregation over Ring-AR for DT tasks in commodity data centers. eRAR exploits Ring-AR’s topology for in-kernel gradient aggregation using eBPF, enabling packet-level parallelism and avoiding the overhead of network stack. It ensures reliability through ring-based retransmission and accelerates computations via SIMD-enabled kfuncs. eRAR has the advantages of hardware-agnostic, network-topology-independent, and resource-efficient. Our experimental results on four popular DNN models demonstrate that, compared to aggregation based on TCP/IP network stack, eRAR improves the gradient aggregation throughput by 77.2%. Furthermore, eRAR reduces the communication time by up to 37.4% compared to existing systems.

Link to Preprint

https://github.com/y1582240351/eRAR/blob/main/preprint.pdf

Zhenyu Zhang

School of Computer Science and Engineering, Sun Yat-sen University

Pengfei Chen

Sun Yat-sen University

Guangba Yu

School of Computer Science and Engineering, Sun Yat-sen University

Zilong He

Sun Yat-sen University

Xiaoyun Li

Sun Yat-sen University

China

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sat 21 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 13:00	Session4: Code Optimization and Software ArchitectureResearch Track at Cosmos 3A Chair(s): Changhai Nie Nanjing University

11:00 15m Talk		Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet Research Track Zhenyu Zhang School of Computer Science and Engineering, Sun Yat-sen University, Pengfei Chen Sun Yat-sen University, Guangba Yu School of Computer Science and Engineering, Sun Yat-sen University, Zilong He Sun Yat-sen University, Xiaoyun Li Sun Yat-sen University Pre-print
11:15 15m Talk		Exploiting Booster Pass Chain for Compiler Phase Ordering Research Track yihan chen , Huanhuan Chen Nanjing University, Yuan Yao Nanjing University, Ping Yu Nanjing University, Feng Xu Nanjing University, Xiaoxing Ma Nanjing University File Attached
11:30 15m Talk		DeFS: A Decentralized and High-Performance File System for Consortium Systems Research Track Yitong Cheng Shanghai JiaoTong University, Shenglong Zhao Shanghai JiaoTong University, Yang Yu Shanghai Jiao Tong University, China, Zhichao Hua Shanghai Jiao Tong University
11:45 15m Talk		Proteus: An Automatical High-Efficiency Framework for Generating Compact and Printable Shellcode on ARMv8 Research Track Jian Lin Information Engineering University, Guoan Liu Information Engineering University, Rui Chang Zhejiang University, Ruimin Wang Information Engineering University
12:00 15m Talk		Modeling Go Concurrency: A Static Analysis Approach to Data Race Detection Research Track Fengjuan Gao Nanjing University of Science and Technology, Mumu Zhang Nanjing University, Zixiao Zhao Nanjing University, Yu Wang Nanjing University, Xuandong Li Nanjing University
12:15 15m Talk		RABBIT: Managing Hierarchical Memory with Intelligent Tiering Aware Deduplication Research Track Zilu Yao National University of Defense Technology, Yinjin Fu Sun Yat-sen University, Nong Xiao National University of Defense Technology & Sun Yat-sen University
12:30 15m Talk		DPCapsule: A Decentralized Private Computing System With Self-Controlled Data Research Track Yitong Cheng Shanghai JiaoTong University, Yang Yu Shanghai Jiao Tong University, China, Zhichao Hua Shanghai Jiao Tong University
12:45 15m Talk		MicroGuard:Non-Intrusive Dynamic Analysis for Inter-Service Access Control of Microservices Research Track Haoming Luo School of Computer Science and Engineering, Sun Yat-sen University, Wanqi Yang Sun Yat-Sen University, Pengfei Chen Sun Yat-sen University

Information for Participants

Sat 21 Jun 2025 11:00 - 13:00 at Cosmos 3A - Session4: Code Optimization and Software Architecture Chair(s): Changhai Nie

Info for room Cosmos 3A:

Cosmos 3A is the first room in the Cosmos 3 wing.

When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.