Revisiting Deep Learning for Variable Type Recovery (ICPC 2023 - Replications and Negative Results (RENE))

Who

Kevin Cao, Kevin Leach

Track

ICPC 2023 Replications and Negative Results (RENE)

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 May 2023 14:12 - 14:17 at Meeting Room 106 - Programming Languages, Types, and Complexity Chair(s): Vittoria Nardone

Abstract

Compiled binary executables are often the only available artifact in reverse engineering, malware analysis, or maintenance of software systems. Unfortunately, the lack of semantic information like variable names makes comprehending binaries difficult. In efforts to improve the comprehensibility of binaries, researchers have recently used machine learning techniques to predict semantic information contained in the original source code. Chen et al. implemented DIRTY, a Transformer-based Encoder-Decoder architecture capable of augmenting decompiled code with variable names and types by leveraging decompiler output tokens and variable size information. Chen et al. were able to demonstrate a substantial increase in name and type extraction accuracy on Hex-Rays decomiler outputs compared to existing static analysis and AI-based techniques. We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler. Although Chen et al. concluded that Ghidra was not a suitable decompiler candidate due to its difficulty in parsing DWARF, we demonstrate that straightforward parsing of variable data generated by Ghidra results in similar retyping performance. We hope this work inspires further interest and adoption of the Ghidra decompiler for use in research projects.

Link to Preprint

https://arxiv.org/pdf/2304.03854.pdf

Kevin Cao

Vanderbilt University

Kevin Leach