A Heuristic for Periodic Memory Allocation with Little Fragmentation to Train Neural Networks (ISMM 2024 - International Symposium on Memory Management)

Who

Akifumi Imanishi, Zijian Xu

Track

ISMM 2024

Time Zone

The program is currently displayed in (GMT+02:00) Windhoek.

Use conference time zone: (GMT+02:00) WindhoekSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 25 Jun 2024 16:20 - 16:40 at Iceland - ISMM: Session 4 - Potpourri Chair(s): Tony Hosking

Abstract

Neural network training requires immense GPU memory, and memory optimization methods such as recomputation are being actively researched.

Recent improvements in recomputation have reduced more than 90 % of the peak allocated size. However, because it produces complex irregular allocation patterns, PyTorch's default caching allocator wastes up to 20 % of memory because of severe fragmentation and increases the cache management overhead.

The periodic allocation patterns during training make offline memory optimization possible.

Dynamic storage allocation (DSA) is the problem of minimizing the memory address ranges given the allocation pattern. This is defined as the 2D bin-packing problem, in which each rectangle can move only vertically.

Although the first-fit or best-fit heuristics perform well for DSA, we propose a simulated annealing-based non-trivial heuristic algorithm that optimizes the topological ordering of allocations to further minimize fragmentation.

The proposed algorithm evaluates a candidate allocation plan with O(log N) amortized time, where N is the number of allocations.

We empirically tested our algorithm on both randomly generated data and allocation patterns obtained by training popular vision and text models with recomputation.

The experiments showed that, on average, our algorithm reduced fragmentation caused by the PyTorch caching allocator from 29.5 % to 0.4 %, compared to 5.3 % by the first-fit method.

DOI

https://doi.org/10.1145/3652024.3665508

Akifumi Imanishi

Preferred Networks

Japan

Zijian Xu