Why VM Benchmarking is Probably Misleading you
So you are evaluating a Just-In-Time (JIT) compiler? You probably know that you have to run benchmarks and conventional wisdom says that you should take measurements after the JIT has “warmed up”. But how many benchmark iterations is that? 5, 10, 20? Over the course of 2 years we devised (probably) the world’s most rigorous Virtual Machine (VM) benchmarking experiment, designed to measure the peak performance and warmup time of modern JITted VMs such as Hotspot, LuaJIT, PyPy, Graal and V8. To our surprise, not only did few of these VMs warm up as expected, but in many cases performance did not stabilise after thousands of iterations or even degraded over time. In this presentation I will show results which will make you reconsider the assumptions which underpin widespread benchmarking practice. I will show a new benchmarking technique which takes neither peak performance – nor even a steady state – as a given.
Edd is a Research Associate at King’s College London where he works on Just-In-Time compilation techniques for dynamic languages.
Prior to his current position, Edd studied for a PhD in static analysis of binary code from University of Kent in England.