Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators
Growing interests in on-device AI have led to the proliferation of accelerators dedicated to neural network inference. Most ASIC accelerators are equipped with compiler-controlled scratchpad memory (SPM) used as a last-level cache to reduce the number of accesses to off-chip memory. A widely-used strategy for utilizing SPM is fused-layer execution, which divides a DNN model into groups of layers and forwards the intermediate results within each group without eviction to the off-chip memory. However, layer fusion has an inherent limitation that the fusion of consecutive layers increases the amount of computations, leading to sub-optimal performance.
This paper introduces a new dimension to SPM usage, which temporarily pins a feature map on SPM. Pinning reduces off-chip transfer without computation increase, but it is not applicable to all feature maps due to limited SPM size. We find that superior performance can be achieved by combination of pinning and fusion in MobileNet. Based on this observation, we propose a model-level optimization method that jointly applies pinning and fusion to minimize inference latency under memory constraints. Scheduling and allocation schemes are presented for automatic generation of optimized codes. Evaluation on the commercial AI accelerator shows that the proposed method reduces off-chip transfer of feature maps by 50% and improves inference latency by 15% on average without additional hardware, compared to the state-of-the-art fusion approach.
Wed 1 MarDisplayed time zone: Eastern Time (US & Canada) change
10:00 - 12:00 | Session 7 -- Neural Network AcceleratorsMain Conference at Montreal 1-2-3 Chair(s): Lukas Sommer Codeplay Software | ||
10:00 26mTalk | Flexer: Out-of-Order Scheduling for Multi-NPUs Main Conference Hyemi Min Seoul National University, Jungyoon Kwon Seoul National University, Bernhard Egger Seoul National University DOI | ||
10:26 26mTalk | Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators Main Conference Hyuk-Jin Jeong Samsung Research, JiHwan Yeo Samsung Research, Cheongyo Bahk Samsung Research, JongHyun Park Samsung Research DOI | ||
10:52 26mTalk | Accelerating Deep Neural Networks on Mobile Multicore NPUs Main Conference Hanwoong Jung Samsung Advanced Institute of Technology, Hexiang Ji Samsung Research, Alexey Pushchin Samsung Research, Maxim Ostapenko Samsung Advanced Institute of Technology, Wenlong Niu Samsung Research, Ilya Palachev Samsung Research, Yutian Qu Samsung Research, Pavel Fedin Samsung Research, Yuri Gribov Samsung Research, Heewoo Nam Samsung Advanced Institute of Technology, Dongguen Lim Samsung Advanced Institute of Technology, Hyunjun Kim Samsung Advanced Institute of Technology, Joonho Song Samsung Advanced Institute of Technology, Seungwon Lee Samsung Advanced Institute of Technology, Hwansoo Han Sungkyunkwan University DOI | ||
11:18 26mTalk | PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM Main Conference DOI |