TierTrain: Proactive Memory Tiering for CPU-Based DNN Training (ISMM 2025 - International Symposium on Memory Management)

Who

Sathvik Swaminathan, Sandeep Kumar, Aravinda Prasad, Sreenivas Subramoney

Track

ISMM 2025

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 17 Jun 2025 16:00 - 16:20 at Lilac - Session 4: 1540-1705 [Systems and Architecture] Chair(s): Steve Blackburn

Abstract

Deep neural networks (DNNs) are one of the popular models for learning relationships between complex data. Training a DNN model is a compute- and memory-intensive operation. The size of modern DNN models spans into the terabyte region, requiring multiple accelerators to train – driving up the training cost. Such humongous memory requirements shift the focus toward memory rather than computation.

CPU-memory, on the other hand, can be scaled to several terabytes with new emerging memory technologies such as HBM and CXL-attached memories. Furthermore, recent advancements to the CPUs in terms of dedicated instructions for DNN training and inference are bridging the compute gap between CPUs and accelerators.

Proposed is an exploratory work in the direction of cost-effective DNN training on CPUs where we aim to alleviate memory management challenges in DNN training. We propose TierTrain, a novel memory tiering solution based on a dynamic queuing system that leverage the periodic and deterministic memory access behavior in DNN training to manage data placement across memory tiers. TierTrain proactively manages tensors by aggressively offloading them to slow memory tiers (NVMM, CXL) and timely prefetching them back to fast memory tiers (HBM, DRAM). Our evaluation of TierTrain on a tiered memory system with a real CXL-attached memory used for memory expansion and NVMM as a low cost memory results in average fast memory footprint reduction of 59–83% and peak fast memory footprint reduction of 25–74% with a performance overhead of 1–16%. In a memory-constrained scenario, TierTrain outperforms the state-of-the-art tiering by improving the performance by 35–84% for a set of popular DNN training models.

DOI

https://doi.org/10.1145/3735950.3735956

Sathvik Swaminathan

Intel Labs

India

Sandeep Kumar

Intel Labs

India

Aravinda Prasad

Intel Labs

India

Sreenivas Subramoney

Intel Labs

India

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 17 Jun
Displayed time zone: Seoul change

15:40 - 17:05	Session 4: 1540-1705 [Systems and Architecture]ISMM 2025 at Lilac Chair(s): Steve Blackburn Google and Australian National University

15:40 20m Talk		Fully Randomized Pointers ISMM 2025 Sai Dhawal Phaye National University of Singapore, Gregory J. Duck National University of Singapore, Roland H. C. Yap National University of Singapore, Trevor E. Carlson National University of Singapore DOI
16:00 20m Talk		TierTrain: Proactive Memory Tiering for CPU-Based DNN Training ISMM 2025 Sathvik Swaminathan Intel Labs, Sandeep Kumar Intel Labs, Aravinda Prasad Intel Labs, Sreenivas Subramoney Intel Labs DOI
16:20 20m Talk		EMD: Fair and Efficient Dynamic Memory De-bloating of Transparent Huge PagesRecorded ISMM 2025 Parth Gangar Fujitsu Research of India, Ashish Panwar Microsoft Research India, K. Gopinath Rishihood University DOI
16:40 20m Talk		Compiler-Assisted Crash Consistency for PMEMRecorded ISMM 2025 Yun Joon Soh University of California San Diego, Sihang Liu University of Waterloo, Steven Swanson University of California San Diego, Jishen Zhao University of California San Diego DOI
17:00 5m Day closing		Closing remarks ISMM 2025 Martin Maas Google, Tim Harris OpenAI, Onur Mutlu ETH Zurich