HexT5: Unified Pre-training for Stripped Binary Code Information Inference (ASE 2023 - Research Papers)

Who

Jiaqi Xiong, Guoqiang Chen, Kejiang Chen, Han Gao, Shaoyin Cheng, Weiming Zhang

Track

ASE 2023 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 13 Sep 2023 14:18 - 14:30 at Plenary Room 2 - Code Summarization Chair(s): Ray Buse

Abstract

Decompilation is a widely used process for reverse engineers to significantly enhance code readability by lifting assembly code to a higher-level C-like language, pseudo-code. Nevertheless, the process of compilation and stripping irreversibly discards high-level semantic information that is crucial to code comprehension, such as comments, identifier names, and types. Existing approaches typically recover only one type of information, making them suboptimal for semantic inference. In this paper, we treat pseudo-code as a special programming language, then present a unified pre-trained model, HexT5, that is trained on vast amounts of natural language comments, source identifiers, and pseudo-code using novel pseudo-code-based pretraining objectives. We fine-tune HexT5 on various downstream tasks, including code summarization, variable name recovery, function name recovery, and similarity detection. Comprehensive experiments show that HexT5 achieves state-of-the-art performance on four downstream tasks, and it demonstrates the robust effectiveness and generalizability of HexT5 for binary-related tasks.

File attachments

[paper] HexT5: Unified Pre-training for Stripped Binary Code Information Inference (HexT5_ASE_2023.pdf)	1.4MiB
[slides] HexT5: Unified Pre-training for Stripped Binary Code Information Inference (HexT5-ASE2023-slides.pdf)	1.14MiB

Jiaqi Xiong

University of Science and Technology of China

Guoqiang Chen

University of Science and Technology of China

China

Kejiang Chen

University of Science and Technology of China

Han Gao

University of Science and Technology of China

China

Shaoyin Cheng

University of Science and Technology of China

China

Weiming Zhang

University of Science and Technology of China

China

HexT5: Unified Pre-training for Stripped Binary Code Information Inference (ASE 2023)

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 13 Sep
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

13:30 - 15:00	Code SummarizationResearch Papers at Plenary Room 2 Chair(s): Ray Buse Google

13:30 12m Talk		Delving into Commit-Issue Correlation to Enhance Commit Message Generation Models Research Papers Liran Wang Beihang University, Xunzhu Tang University of Luxembourg, Yichen He Beihang University, Changyu Ren Beihang University, Shuhua Shi Beihang University, Chaoran Yan Beihang University, Zhoujun Li Beihang University Pre-print File Attached
13:42 12m Talk		From Commit Message Generation to History-Aware Commit Message Completion Research Papers Aleksandra Eliseeva JetBrains Research, Yaroslav Sokolov JetBrains, Egor Bogomolov JetBrains Research, Yaroslav Golubev JetBrains Research, Danny Dig JetBrains Research & University of Colorado Boulder, USA, Timofey Bryksin JetBrains Research Pre-print File Attached
13:54 12m Talk		Automatic Generation and Reuse of Precise Library Summaries for Object-Sensitive Pointer Analysis Research Papers Jingbo Lu University of New South Wales, Dongjie He UNSW, Wei Li University of New South Wales, Yaoqing Gao Huawei Toronto Research Center, Jingling Xue UNSW Pre-print File Attached
14:06 12m Talk		What Makes Good In-context Demonstrations for Code Intelligence Tasks with LLMs? Research Papers Shuzheng Gao The Chinese University of Hong Kong, Xin-Cheng Wen Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Wenxuan Wang Chinese University of Hong Kong, Hongyu Zhang Chongqing University, Michael Lyu The Chinese University of Hong Kong Pre-print File Attached
14:18 12m Talk		HexT5: Unified Pre-training for Stripped Binary Code Information InferenceRecorded talk Research Papers Jiaqi Xiong University of Science and Technology of China, Guoqiang Chen University of Science and Technology of China, Kejiang Chen University of Science and Technology of China, Han Gao University of Science and Technology of China, Shaoyin Cheng University of Science and Technology of China, Weiming Zhang University of Science and Technology of China Media Attached File Attached
14:30 12m Talk		Generating Variable Explanations via Zero-shot Prompt LearningRecorded talk Research Papers Chong Wang Fudan University, Yiling Lou Fudan University, Junwei Liu Fudan University, Xin Peng Fudan University Media Attached