Retargeting and Respecializing GPU Workloads for Performance Portability
In order to come close to peak performance, accelerators like GPUs require significant architecture-specific tuning that understand the availability of shared memory, parallelism, tensor cores, etc. Unfortunately, the pursuit of higher-performance and lower costs have led to a significant diversification of architecture designs across, even from the same vendor. This creates the need for performance portability across different GPUs, especially important for programs in a particular programming model with a certain architecture in mind. Even when the program can be seamlessly executed on a different architecture, it may suffer a performance penalty due to it not being sized appropriately to the available hardware resources such as fast memory and registers, let alone not using more newer advanced features of the architecture.
We propose a new approach to improving performance of (legacy) CUDA programs for modern machines by automatically adjusting the amount of work each parallel thread does, and the amount of memory and register resources it requires. By operating within the MLIR compiler infrastructure, we are able to also target AMD GPUs performing automatic translation from CUDA and simultaneously adjusting the program granularity to fit the size of target GPUs.
Combined with autotuning assisted by the platform-specific compiler, our approach demonstrates 16% geomean speedup on the Rodinia benchmark suite over baseline CUDA implementation as well as performance parity between similar NVIDIA and AMD GPUs executing the same CUDA program.
Mon 4 MarDisplayed time zone: London change
14:20 - 15:40 | Compilers for GPUsMain Conference at Tinto Chair(s): Roland Leißa University of Mannheim, School of Business Informatics and Mathematics | ||
14:20 20mTalk | A Framework for Fine-Grained Synchronization of Dependent GPU Kernels Main Conference Abhinav Jangda Microsoft Research, Saeed Maleki Microsoft Research, Maryam Mehri Dehnavi University of Toronto, Madan Musuvathi Microsoft Research, Olli Saarikivi Microsoft Research Pre-print | ||
14:40 20mTalk | Enhancing Performance through Control-Flow Unmerging and Loop Unrolling on GPUs Main Conference Alnis Murtovi TU Dortmund, Giorgis Georgakoudis Lawrence Livermore National Laboratory, Konstantinos Parasyris Lawrence Livermore National Laboratory, Chunhua Liao Lawrence Livermore National Laboratory, Ignacio Laguna Lawrence Livermore National Laboratory, Bernhard Steffen TU Dortmund | ||
15:00 20mTalk | Retargeting and Respecializing GPU Workloads for Performance Portability Main Conference Ivan Radanov Ivanov Tokyo Institute of Technology; RIKEN R-CCS, Oleksandr Zinenko Google DeepMind, Jens Domke RIKEN R-CCS, Toshio Endo Tokyo Institute of Technology, William S. Moses University of Illinois at Urbana-Champaign; Google DeepMind | ||
15:20 20mTalk | Seer: Predictive Runtime Kernel Selection for Irregular Problems Main Conference Pre-print |