(De/Re)-Compositions Expressed Systematically via MDH-Based Schedules
Achieving the full performance potential of modern architectures, such as GPU and CPU, is complex: computations have to be efficiently de-composed for the memory and core hierarchies of architectures, and the computed intermediate results have to be re-composed to the final result (we say “(de-)composition” for short). State-of-the-art code generation approaches often achieve high performance by allowing an expert user to express (de-)compositions in form of a so-called “scheduling program”. However, the scheduling languages of existing approaches usually rely on a vast set of low-level commands that have to be combined in complex and error-prone ways to express a well-performing (de-)composition of computations.
We introduce a new scheduling language, based on the formalism of “Multi-Dimensional Homomorphisms~(MDH)”. In contrast to existing scheduling languages, our MDH-based language is designed to systematically (de-)compose computations for the memory and core hierarchies of parallel architectures using a single, high-level scheduling primitive only. We show that our scheduling primitive is easy to use and yet expressive enough to express well-performing (de-)compositions of popular related approaches, e.g., the TVM compiler, for MDH-supported computations (such as linear algebra routines and stencil computations). Moreover, our language is designed as auto-tunable, i.e., each optimization decision can optionally be left to the auto-tuning engine of our system, and our system can automatically recommend complete schedules for the user, based on its auto-tuning capabilities. Also, by relying on the MDH approach, we can formally guarantee the correctness of optimizations expressed in our language, thereby further enhancing user experience. Our experiments on GPU and CPU confirm that we can express optimizations that cannot be expressed straightforwardly (or at all) in TVM’s scheduling language, thereby achieving higher performance than TVM, and also vendor libraries provided by NVIDIA and Intel, for time-intensive computations used in real-world deep learning neural networks.
Sat 25 FebDisplayed time zone: Eastern Time (US & Canada) change
11:20 - 12:20
|Efficiently Learning Locality Optimizations by Decomposing Transformation Domains|
|A Deep Learning Model for Loop Interchange|
Lina Mezdour NYU Abu Dhabi; ESI, Khadidja Kadem NYU Abu Dhabi; ESI, Massinissa Merouani NYU Abu Dhabi, Amina Selma Haichour ESI, Saman Amarasinghe Massachusetts Institute of Technology, Riyadh Baghdadi NYU Abu DhabiDOI
|(De/Re)-Compositions Expressed Systematically via MDH-Based Schedules|
Ari Rasch University of Muenster, Richard Schulze University of Muenster, Denys Shabalin Google, Anne Elster NTNU, Sergei Gorlatch University of Muenster, Mary Hall University of UtahDOI