The Architecture Independent Workload Characterization
OpenCL is an attractive programming model for high-performance computing systems, with wide support from hardware vendors and significant performance portability. To support efficient scheduling on HPC systems it is necessary to perform accurate performance predictions for OpenCL workloads on varied compute devices, which is challenging due to diverse computation, communication and memory access characteristics which result in varying performance between devices. In this talk I will present a comprehensive benchmark suite for OpenCL in the heterogenous HPC setting: an extended and enhanced version of the Open-Dwarfs OpenCL benchmark suite. Our extensions improve portability and robustness of applications, correctness of results and choice of problem size, and increase diversity through coverage of additional application patterns.
The Architecture Independent Workload Characterization (AIWC) tool can be used to characterize OpenCL kernels according to a set of architecture- independent features. This talk will also discuss the design decisions made to collect AIWC features. AIWC is a useful tool for benchmark developers since it:
- provides insights around the inclusion of an application via diversity analysis of the feature-space.
- measures requirements in terms of FLOPs, memory movement and integer ops of any application kernel – which allows the automatic calculation of theoretical peak performance for a given device.
- can be used to examine the phase-transitional properties of application codes – for instance if the instruction mix changes over time in terms of the balance between floating-point and memory operations.
This work culminates in a methodology where AIWC features are used to form a model capable of predicting accelerator execution times. We used this methodology to predict execution times for a set of 37 computational kernels running on 15 different devices representing a broad range of CPU, GPU and MIC architectures. The predictions are highly accurate, differing from the measured experimental run-times by an average of only 1.2%, and correspond to actual execution time mispredictions of 9 μs to 1 sec according to problem size. A previously unencountered code can be instrumented once and the AIWC metrics embedded in the kernel, to allow performance prediction across the full range of modeled devices. The results suggest that this methodology supports correct selection of the most appropriate device for a previously unencountered code, which is highly relevant to the HPC scheduling setting.
Beau Johnston is completing his PhD at the Australian National University. His work focuses on developing tools to facilitate the efficient scheduling of accelerators in supercomputers. Other research interests include signal processing on HPC and embedded architectures specifically with a focus on computer vision, precision agriculture and brain machine interfaces.
Wed 18 Jul
Ben HermannUniversity of Paderborn, Lisa Nguyen Quang DoPaderborn University, Eric BoddenHeinz Nixdorf Institut, Paderborn University and Fraunhofer IEMFile Attached
Beau JohnstonAustralian National UniversityFile Attached