Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper addresses techniques to maximize the utilization of NPU cores and reduce the latency of on-device inference. Mobile NPUs typically have a small amount of local memory (or scratch pad memory, SPM) that provides space only enough for input/output tensors and weights of one layer operation in deep neural networks (DNNs). Even in multicore NPUs, such local memories are distributed across the cores. In such systems, executing network layer operations in parallel is the primary vehicle to achieve performance. By partitioning a layer of DNNs into multiple sub-layers, we can execute them in parallel on multicore NPUs. Within a core, we can also employ pipelined execution to reduce the execution time of a sub-layer. In this execution model, synchronizing parallel execution and loading/storing intermediate tensors in global memory are the main bottlenecks. To alleviate these problems, we propose novel optimization techniques which carefully consider partitioning direction, execution order, synchronization, and global memory access. Using six popular convolutional neural networks (CNNs), we evaluate our optimization techniques in a flagship mobile SoC with three cores. Compared to the highest-performing partitioning approach, our techniques improve performance by 23%, achieving a speedup of 2.1x over single-core systems.
Wed 1 MarDisplayed time zone: Eastern Time (US & Canada) change
10:00 - 12:00 | Session 7 -- Neural Network AcceleratorsMain Conference at Montreal 1-2-3 Chair(s): Lukas Sommer Codeplay Software | ||
10:00 26mTalk | Flexer: Out-of-Order Scheduling for Multi-NPUs Main Conference Hyemi Min Seoul National University, Jungyoon Kwon Seoul National University, Bernhard Egger Seoul National University DOI | ||
10:26 26mTalk | Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators Main Conference Hyuk-Jin Jeong Samsung Research, JiHwan Yeo Samsung Research, Cheongyo Bahk Samsung Research, JongHyun Park Samsung Research DOI | ||
10:52 26mTalk | Accelerating Deep Neural Networks on Mobile Multicore NPUs Main Conference Hanwoong Jung Samsung Advanced Institute of Technology, Hexiang Ji Samsung Research, Alexey Pushchin Samsung Research, Maxim Ostapenko Samsung Advanced Institute of Technology, Wenlong Niu Samsung Research, Ilya Palachev Samsung Research, Yutian Qu Samsung Research, Pavel Fedin Samsung Research, Yuri Gribov Samsung Research, Heewoo Nam Samsung Advanced Institute of Technology, Dongguen Lim Samsung Advanced Institute of Technology, Hyunjun Kim Samsung Advanced Institute of Technology, Joonho Song Samsung Advanced Institute of Technology, Seungwon Lee Samsung Advanced Institute of Technology, Hwansoo Han Sungkyunkwan University DOI | ||
11:18 26mTalk | PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM Main Conference DOI |