Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet
As deep neural networks (DNN) continue to scale up in size to achieve greater capabilities, distributed training (DT) has become the prevailing approach to accelerate the training process. However, according to our observation on the network communication overheads in DT within Ethernet, the Linux kernel network stack accounts for 30% to 40% of the total communication time, posing a significant bottleneck to training efficiency. To mitigate the overhead introduced by the kernel network stack, we propose eRAR, an eBPF-based gradient aggregation over Ring-AR for DT tasks in commodity data centers. eRAR exploits Ring-AR’s topology for in-kernel gradient aggregation using eBPF, enabling packet-level parallelism and avoiding the overhead of network stack. It ensures reliability through ring-based retransmission and accelerates computations via SIMD-enabled kfuncs. eRAR has the advantages of hardware-agnostic, network-topology-independent, and resource-efficient. Our experimental results on four popular DNN models demonstrate that, compared to aggregation based on TCP/IP network stack, eRAR improves the gradient aggregation throughput by 77.2%. Furthermore, eRAR reduces the communication time by up to 37.4% compared to existing systems.
Sat 21 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
| 11:00 - 13:00 | Session4: Code Optimization and Software ArchitectureResearch Track at Cosmos 3A  Chair(s): Changhai Nie Nanjing University | ||
| 11:0015m Talk | Take Kernel Stack Overhead Out: eBPF-Enhanced Network Acceleration for Distributed Training within Ethernet Research Track Zhenyu Zhang School of Computer Science and Engineering, Sun Yat-sen University, Pengfei Chen Sun Yat-sen University, Guangba Yu School of Computer Science and Engineering, Sun Yat-sen University, Zilong He Sun Yat-sen University, Xiaoyun Li Sun Yat-sen UniversityPre-print | ||
| 11:1515m Talk | Exploiting Booster Pass Chain for Compiler Phase Ordering Research Track yihan chen , Huanhuan Chen Nanjing University, Yuan Yao Nanjing University, Ping Yu Nanjing University, Feng Xu Nanjing University, Xiaoxing Ma Nanjing UniversityFile Attached | ||
| 11:3015m Talk | DeFS: A Decentralized and High-Performance File System for Consortium Systems Research Track Yitong Cheng Shanghai JiaoTong University, Shenglong Zhao Shanghai JiaoTong University, Yang Yu Shanghai Jiao Tong University, China, Zhichao Hua Shanghai Jiao Tong University | ||
| 11:4515m Talk | Proteus: An Automatical High-Efficiency Framework for Generating Compact and Printable Shellcode on ARMv8 Research Track Jian Lin Information Engineering University, Guoan Liu Information Engineering University, Rui Chang Zhejiang University, Ruimin Wang Information Engineering University | ||
| 12:0015m Talk | Modeling Go Concurrency: A Static Analysis Approach to Data Race Detection Research Track Fengjuan Gao Nanjing University of Science and Technology, Mumu Zhang Nanjing University, Zixiao Zhao Nanjing University, Yu Wang Nanjing University, Xuandong Li Nanjing University | ||
| 12:1515m Talk | RABBIT: Managing Hierarchical Memory with Intelligent Tiering Aware Deduplication Research Track | ||
| 12:3015m Talk | DPCapsule: A Decentralized Private Computing System With Self-Controlled Data Research Track Yitong Cheng Shanghai JiaoTong University, Yang Yu Shanghai Jiao Tong University, China, Zhichao Hua Shanghai Jiao Tong University | ||
| 12:4515m Talk | MicroGuard:Non-Intrusive Dynamic Analysis for Inter-Service Access Control of Microservices Research Track Haoming Luo School of Computer Science and Engineering, Sun Yat-sen University, Wanqi  Yang Sun Yat-Sen University, Pengfei Chen Sun Yat-sen University | ||
Cosmos 3A is the first room in the Cosmos 3 wing.
When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.

