Meaningful Variable Names for Decompiled Code: A Machine Translation Approach (ICPC 2018 - Technical Research)

Who

Alan Jaffe, Jeremy Lacomis, Edward Schwartz, Claire Le Goues, Bogdan Vasilescu

Track

ICPC 2018 Technical Research

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sun 27 May 2018 09:55 - 10:12 at J1 room - Opening, Vision Keynote, and Developer Observation Chair(s): Foutse Khomh, Chanchal K. Roy, Katsuro Inoue

Abstract

When code is compiled, information is lost, including some of the structure of the original source code as well as local identifier names. Existing decompilers can reconstruct much of the original source code, but typically use meaningless placeholder variables for identifier names. Using variable names which are more natural in the given context can make the code much easier to interpret, despite the fact that variable names have no effect on the execution of the program. In theory, it is impossible to recover the original identifier names since that information has been lost. However, most code is natural: it is highly repetitive and predictable based on the context. In this paper we propose a technique that assigns variables meaningful names by taking advantage of this naturalness property. We consider decompiler output to be a noisy distortion of the original source code, where the original source code is transformed into the decompiler output. Using this noisy channel model, we apply standard statistical machine translation approaches to choose natural identifiers, combining a translation model trained on a parallel corpus with a language model trained on unmodified C code. We generate a large parallel corpus from 1.2 TB of C source code obtained from GitHub. Under the most conservative assumptions, our technique is still able to recover the original variable names up to 16.2% of the time, which represents a lower bound for performance.

Link to Preprint

https://apjaffe.github.io/2018-meaningful-variables.pdf

Alan Jaffe

Carnegie Mellon University

United States

Jeremy Lacomis

Carnegie Mellon University

United States

Edward Schwartz

Carnegie Mellon University

Claire Le Goues

Carnegie Mellon University

United States

Bogdan Vasilescu

Carnegie Mellon University

United States

Slides

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Sun 27 May
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

09:00 - 10:30	Opening, Vision Keynote, and Developer ObservationTechnical Research at J1 room Chair(s): Foutse Khomh Polytechnique Montréal, Chanchal K. Roy University of Saskatchewan, Katsuro Inoue Osaka University

09:00 10m Day opening		Welcome to ICPC 2018 Technical Research Foutse Khomh Polytechnique Montréal, Chanchal K. Roy University of Saskatchewan
09:11 34m Talk		Sensing and Supporting Software Developer's Focus (Vision Keynote)Vision Keynote Technical Research Manuela Zueger University of Zurich, Thomas Fritz University of Zurich, University of British Columbia
09:45 10m Short-paper		Code Phonology: an exploration into the vocalization of codeERA Technical Research Felienne Hermans , Alaaeddin Swidan Delft University of Technology, Efthimia Aivaloglou Open University of the Netherlands
09:55 17m Full-paper		Meaningful Variable Names for Decompiled Code: A Machine Translation ApproachTechnical Research Technical Research Alan Jaffe Carnegie Mellon University, Jeremy Lacomis Carnegie Mellon University, Edward Schwartz Carnegie Mellon University, Claire Le Goues Carnegie Mellon University, Bogdan Vasilescu Carnegie Mellon University Pre-print Media Attached
10:13 17m Full-paper		Descriptive Compound Identifier Names Improve Source Code ComprehensionTechnical Research Technical Research Andrea Schankin Karlsruhe Institute of Technology, Annika Berger Karlsruhe Institute of Technology, Daniel Holt Heidelberg University, Johannes Hofmeister University of Passau, Till Riedel Karlsruhe Institute of Technology, Michael Beigl Karlsruhe Institute of Technology Pre-print