APLAS 2024
Tue 22 - Fri 25 October 2024 Kyoto

Can one build a compiler that will eventually save a lot of performance engineering effort while immediately delivering competitive results? Can one achieve near hardware-peak performance while offering high-level programming abstractions and doing it now rather than putting it to tomorrow? The question is particularly hot in a domain-specific setting, where classical infrastructure for optimizing compiler construction may be inadequate, too generic, or too low-level.

I used to see the world through the lenses of domain-specific optimization being superior to lower-level kernel programming. And indeed, after decades of trying in the broader supercomputing landscape, the optimizing compiler community finally got reasons to celebrate. ML frameworks and compilers have essentially won the abstraction and automation battle. Purely functional embedded DSLs such as JAX offer performance portability from edge devices to data-centers, and massive code reuse across diverse scenarios from massive scale pre-training to low-latency inference through distillation and quantization.

Yet, the generative AI race and the global scarcity of computing resources (ML accelerators) just brought ML practitioners backwards a decade when it comes to achieving competitive performance. Device-specific kernel programming has become, once again, ubiquitous. The performance of leading generative models comes with a dramatic loss of programmability and portability.

Why and how did we get there? Are we ever going to find a way out of this programmability/performance dilemma? What about the velocity and agility of compiler engineers? Can ML accelerate compiler construction and help address the issue? Can we make ML-based heuristics scale enough to compile billions of lines of code? Didn’t we just design and implement MLIR so that domain-specific compiler engineering could scale, enabling massive code reuse across domains, languages and hardware? We will review these questions, based on recent successes and half-successes in academia and industry. We will also form an invitation to tackle these challenges in future research and software development.

Albert Cohen, Google DeepMind Paris, Research Scientist. Albert works on the acceleration and energy-efficiency of machine learning models. An alumnus of École Normale Supérieure de Lyon and the University of Versailles (Paris Saclay), he first joined INRIA, then also held a part-time associate professor position at École Polytechnique. He has been a visiting scholar at the University of Illinois, an invited professor at Philips Research as a recipient of a Marie Curie technology transfer fellowship, and a visiting professor at Facebook Artificial Intelligence Research. Albert’s work spans the theory and practice of programming languages, parallelism, high-performance and power-efficient computing, as well as safety-critical embedded control, resulting in 250 peer-reviewed publications together with 30 PhD students and international collaborators. Some of this work led to technology transfer, including contributions to the industry standard GCC and LLVM compilers. Since joining Google, Albert contributed to the design and adoption of the MLIR platform for scalable and efficient machine learning.