Accelerated computing has increased the need to specialize how a program is parallelized depending on the target. Fully exploiting a highly parallel accelerator, such as a GPU, demands more parallelism and sometimes more levels of parallelism than a multicore CPU. OpenMP has a directive for each level of parallelism, but choosing directives for each target can incur a significant productivity cost. We argue that using the new OpenMP loop directive with an appropriate compiler decision process can achieve the same performance benefits of target-specific parallelization with the productivity advantage of a single directive for all targets. In this paper, we introduce a fully descriptive model and demonstrate its benefits with an implementation of the loop directive, comparing performance, productivity, and portability against other production compilers using the SPEC ACCEL benchmark suite. We provide an implementation of our proposal in NVIDIA’s HPC compiler. It yields up to 56X speedup and an average of 1.91x-1.79x speedup compared to the baseline performance (depending on the host system) on GPUs, and preserves CPU performance. In addition, our proposal requires 60% fewer parallelism directives.
Wed 6 AprDisplayed time zone: Eastern Time (US & Canada) change
11:20 - 11:50 | Session 4: ParallelismCC Research Papers at CC Virtual Room Chair(s): Bernhard Egger Seoul National University | ||
11:20 15mPaper | Memory Access Scheduling to Reduce Thread Migrations CC Research Papers Sana Damani Georgia Institute of Technology, Prithayan Barua Georgia Institute of Technology, USA, Vivek Sarkar Georgia Institute of Technology DOI | ||
11:35 15mPaper | Performant Portable OpenMP CC Research Papers DOI |