A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules (CGO 2024 - Main Conference)

Who

Jinhu Jiang, Chaoyi Liang, Rongchao Dong, Zhaohui Yang, Zhongjun Zhou, Wenwen Wang, Pen-Chung Yew, Weihua Zhang

Track

CGO 2024 Main Conference

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 6 Mar 2024 11:30 - 11:50 at Tinto - Acceleration Techniques Chair(s): Amir Shaikhha

Abstract

System-level emulators have been used extensively for the design, debugging and evaluation of the system software. They work by providing a system-level virtual machine that can support a guest operating system (OS) running on a platform with the same or different native OS using the same or different instruction-set architecture. For such a system-level emulation, dynamic binary translation (DBT) is one of the core technologies. A recently proposed learning-based approach using automatically-learned translation rules has shown to improve DBT performance significantly with much higher quality translated code. However, it has only been used on user-level emulation, not system-level emulation.

In applying this approach directly on QEMU for system-level emulation, we find it actually causes an unexpected performance degradation of 5% on average. By analyzing its main culprits in more detail, we find that the learning-based approach will by default use host registers to maintain the guest CPU states that include condition-code registers (or FLAG registers). In cases where QEMU needs to be involved (in which QEMU also needs to use the host registers), maintaining system states in the host registers for the guest, the host and QEMU during and between the context switches can cause undue overheads, if not handled carefully. Such cases include emulating system-level instructions, address translation and interrupts, which require the use of QEMU’s helper functions. To achieve the intended performance improvement through better-quality code generated by the learning-based approach, we propose several optimization techniques that include reducing the overhead incurred in each context switch, the number of needed context switches, and better code scheduling to eliminate context switches. Our experimental results show that such optimizations can achieve an average of 1.36X speedup over QEMU 6.1 using SPEC CINT2006 and 1.15X on real-world applications in the system emulation mode.

Link to Preprint

https://arxiv.org/abs/2402.09688

Jinhu Jiang

Fudan University

China

Chaoyi Liang

Fudan University

China

Rongchao Dong

Fudan University

China

Zhaohui Yang

Fudan University

China

Zhongjun Zhou

Fudan University

China

Wenwen Wang

University of Georgia

United States

Pen-Chung Yew

University of Minnesota at Twin Cities

United States

Weihua Zhang

Fudan University

China

Time Zone

The program is currently displayed in (GMT) London.

Use conference time zone: (GMT) LondonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 6 Mar
Displayed time zone: London change

11:30 - 12:50	Acceleration TechniquesMain Conference at Tinto Chair(s): Amir Shaikhha University of Edinburgh

11:30 20m Talk		A System-Level Dynamic Binary Translator using Automatically-Learned Translation Rules Main Conference Jinhu Jiang Fudan University, Chaoyi Liang Fudan University, Rongchao Dong Fudan University, Zhaohui Yang Fudan University, Zhongjun Zhou Fudan University, Wenwen Wang University of Georgia, Pen-Chung Yew University of Minnesota at Twin Cities, Weihua Zhang Fudan University Pre-print
11:50 20m Talk		Instruction Scheduling for the GPU on the GPU Main Conference Ghassan Shobaki California State University, Pınar Muyan-Özçelik California State University, Josh Hutton California State University, Bruce Linck California State University, Vladislav Malyshenko California State University, Austin Kerbow Advanced Micro Devices, Ronaldo Ramirez-Ortega California State University, Vahl Scott Gordon California State University
12:10 20m Talk		JITSPMM: Just-in-Time Instruction Generation for Accelerated Sparse Matrix-Matrix Multiplication Main Conference Qiang Fu Advanced Micro Devices, Thomas B. Rolinger NVIDIA, H. Howie Huang George Washington University Pre-print
12:30 20m Talk		oneDNN Graph Compiler: A Hybrid Approach for High-Performance Deep Learning Compilation Main Conference Jianhui Li Intel, Zhennan Qin Intel, Yijie Mei Intel, Jingze Cui Intel, Yunfei Song Intel, Ciyong Chen Intel, Yifei Zhang Intel, Longsheng Du Intel, Xianhang Cheng Intel, Baihui Jin Intel, Yan Zhang Intel, Jason Ye Intel, Eric Lin Intel, Dan Lavery Intel Pre-print