Accelerating Deep Neural Networks on Mobile Multicore NPUs (CGO 2023 - Main Conference)

Who

Hanwoong Jung, Hexiang Ji, Alexey Pushchin, Maxim Ostapenko, Wenlong Niu, Ilya Palachev, Yutian Qu, Pavel Fedin, Yuri Gribov, Heewoo Nam, Dongguen Lim, Hyunjun Kim, Joonho Song, Seungwon Lee, Hwansoo Han

Track

CGO 2023 Main Conference

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 1 Mar 2023 10:52 - 11:18 at Montreal 1-2-3 - Session 7 -- Neural Network Accelerators Chair(s): Lukas Sommer

Abstract

Neural processing units (NPUs) have become indispensable parts of mobile SoCs. Furthermore, integrating multiple NPU cores into a single chip becomes a promising solution for ever-increasing computing power demands in mobile devices. This paper addresses techniques to maximize the utilization of NPU cores and reduce the latency of on-device inference. Mobile NPUs typically have a small amount of local memory (or scratch pad memory, SPM) that provides space only enough for input/output tensors and weights of one layer operation in deep neural networks (DNNs). Even in multicore NPUs, such local memories are distributed across the cores. In such systems, executing network layer operations in parallel is the primary vehicle to achieve performance. By partitioning a layer of DNNs into multiple sub-layers, we can execute them in parallel on multicore NPUs. Within a core, we can also employ pipelined execution to reduce the execution time of a sub-layer. In this execution model, synchronizing parallel execution and loading/storing intermediate tensors in global memory are the main bottlenecks. To alleviate these problems, we propose novel optimization techniques which carefully consider partitioning direction, execution order, synchronization, and global memory access. Using six popular convolutional neural networks (CNNs), we evaluate our optimization techniques in a flagship mobile SoC with three cores. Compared to the highest-performing partitioning approach, our techniques improve performance by 23%, achieving a speedup of 2.1x over single-core systems.

DOI

https://doi.org/10.1145/3579990.3580015

Hanwoong Jung

Samsung Advanced Institute of Technology

South Korea

Hexiang Ji

Samsung Research

China

Alexey Pushchin

Samsung Research

Russia

Maxim Ostapenko

Samsung Advanced Institute of Technology

South Korea

Wenlong Niu

Samsung Research

China

Ilya Palachev

Samsung Research

Russia

Yutian Qu

Samsung Research

China

Pavel Fedin

Samsung Research

Russia

Yuri Gribov

Samsung Research

Russia

Heewoo Nam

Samsung Advanced Institute of Technology

South Korea

Dongguen Lim

Samsung Advanced Institute of Technology

South Korea

Hyunjun Kim

Samsung Advanced Institute of Technology

South Korea

Joonho Song

Samsung Advanced Institute of Technology

South Korea

Seungwon Lee

Samsung Advanced Institute of Technology

South Korea

Hwansoo Han

Sungkyunkwan University

South Korea

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 1 Mar
Displayed time zone: Eastern Time (US & Canada) change

10:00 - 12:00	Session 7 -- Neural Network AcceleratorsMain Conference at Montreal 1-2-3 Chair(s): Lukas Sommer Codeplay Software

10:00 26m Talk		Flexer: Out-of-Order Scheduling for Multi-NPUs Main Conference Hyemi Min Seoul National University, Jungyoon Kwon Seoul National University, Bernhard Egger Seoul National University DOI
10:26 26m Talk		Pin or Fuse? Exploiting Scratchpad Memory to Reduce Off-Chip Data Transfer in DNN Accelerators Main Conference Hyuk-Jin Jeong Samsung Research, JiHwan Yeo Samsung Research, Cheongyo Bahk Samsung Research, JongHyun Park Samsung Research DOI
10:52 26m Talk		Accelerating Deep Neural Networks on Mobile Multicore NPUs Main Conference Hanwoong Jung Samsung Advanced Institute of Technology, Hexiang Ji Samsung Research, Alexey Pushchin Samsung Research, Maxim Ostapenko Samsung Advanced Institute of Technology, Wenlong Niu Samsung Research, Ilya Palachev Samsung Research, Yutian Qu Samsung Research, Pavel Fedin Samsung Research, Yuri Gribov Samsung Research, Heewoo Nam Samsung Advanced Institute of Technology, Dongguen Lim Samsung Advanced Institute of Technology, Hyunjun Kim Samsung Advanced Institute of Technology, Joonho Song Samsung Advanced Institute of Technology, Seungwon Lee Samsung Advanced Institute of Technology, Hwansoo Han Sungkyunkwan University DOI
11:18 26m Talk		PIMFlow: Compiler and Runtime Support for CNN Models on Processing-in-Memory DRAM Main Conference Yongwon Shin POSTECH, Juseong Park POSTECH, Sungjun Cho POSTECH, Hyojin Sung POSTECH DOI