CGO 2024
Sat 2 - Wed 6 March 2024 Edinburgh, United Kingdom
Wed 6 Mar 2024 11:30 - 11:50 at Tinto - Acceleration Techniques Chair(s): Amir Shaikhha

System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation.

In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.

Wed 6 Mar

Displayed time zone: London change

11:30 - 12:50
Acceleration TechniquesMain Conference at Tinto
Chair(s): Amir Shaikhha University of Edinburgh
11:30
20m
Talk
A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules
Main Conference
Jinhu Jiang Fudan University, Chaoyi Liang Fudan University, Rongchao Dong Fudan University, Zhaohui Yang Fudan University, Zhongjun Zhou Fudan University, Wenwen Wang University of Georgia, Pen-Chung Yew University of Minnesota at Twin Cities, Weihua Zhang Fudan University
Pre-print
11:50
20m
Talk
Instruction Scheduling for the GPU on the GPU
Main Conference
Ghassan Shobaki California State University, Pınar Muyan-Özçelik California State University, Josh Hutton California State University, Bruce Linck California State University, Vladislav Malyshenko California State University, Austin Kerbow Advanced Micro Devices, Ronaldo Ramirez-Ortega California State University, Vahl Scott Gordon California State University
12:10
20m
Talk
JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication
Main Conference
Qiang Fu Advanced Micro Devices, Thomas B. Rolinger NVIDIA, H. Howie Huang George Washington University
Pre-print
12:30
20m
Talk
oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation
Main Conference
Jianhui Li Intel, Zhennan Qin Intel, Yijie Mei Intel, Jingze Cui Intel, Yunfei Song Intel, Ciyong Chen Intel, Yifei Zhang Intel, Longsheng Du Intel, Xianhang Cheng Intel, Baihui Jin Intel, Yan Zhang Intel, Jason Ye Intel, Eric Lin Intel, Dan Lavery Intel
Pre-print