Retargeting and Respecializing GPU Workloads for Performance Portability (CGO 2024 - Main Conference)

Who

Ivan Radanov Ivanov, Oleksandr Zinenko, Jens Domke, Toshio Endo, William S. Moses

Track

CGO 2024 Main Conference

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 4 Mar 2024 15:00 - 15:20 at Tinto - Compilers for GPUs Chair(s): Roland Leißa

Abstract

In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture.

We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs.

Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.

Ivan Radanov Ivanov

Tokyo Institute of Technology; RIKEN R-CCS

Japan

Oleksandr Zinenko

Google DeepMind

France

Jens Domke

RIKEN R-CCS

Japan

Toshio Endo

Tokyo Institute of Technology

Japan

William S. Moses

University of Illinois at Urbana-Champaign; Google DeepMind

United States

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 4 Mar
Displayed time zone: London change

14:20 - 15:40	Compilers for GPUsMain Conference at Tinto Chair(s): Roland Leißa University of Mannheim, School of Business Informatics and Mathematics

14:20 20m Talk		A Framework for Fine-Grained Synchronization of Dependent GPU Kernels Main Conference Abhinav Jangda Microsoft Research, Saeed Maleki Microsoft Research, Maryam Mehri Dehnavi University of Toronto, Madan Musuvathi Microsoft Research, Olli Saarikivi Microsoft Research Pre-print
14:40 20m Talk		Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs Main Conference Alnis Murtovi TU Dortmund, Giorgis Georgakoudis Lawrence Livermore National Laboratory, Konstantinos Parasyris Lawrence Livermore National Laboratory, Chunhua Liao Lawrence Livermore National Laboratory, Ignacio Laguna Lawrence Livermore National Laboratory, Bernhard Steffen TU Dortmund
15:00 20m Talk		Retargeting and Respecializing GPU Workloads for Performance Portability Main Conference Ivan Radanov Ivanov Tokyo Institute of Technology; RIKEN R-CCS, Oleksandr Zinenko Google DeepMind, Jens Domke RIKEN R-CCS, Toshio Endo Tokyo Institute of Technology, William S. Moses University of Illinois at Urbana-Champaign; Google DeepMind
15:20 20m Talk		Seer: Predictive Runtime Kernel Selection for Irregular Problems Main Conference Ryan Swann AMD, Muhammad Osama AMD, Karthik Sangaiah AMD, Jalal Mahmud AMD Pre-print