A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer (SANER 2025 - Research Papers)

Who

Hanxiao Lu, Hongyu Cai, Yiming Liang, Antonio Bianchi, Z. Berkay Celik

Track

SANER 2025 Research Papers

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 6 Mar 2025 11:30 - 11:45 at M-1410 - Program Analysis Chair(s): Rrezarta Krasniqi

Abstract

Language model approaches have recently been integrated into binary analysis tasks, such as function similarity detection and function signature recovery. These models typically employ a two-stage training process: pre-training via Masked Language Modeling (MLM) on machine code and fine-tuning for specific tasks. While MLM helps to understand binary code structures, it ignores essential code characteristics, including control and data flow, which negatively affect model generalization. Recent work leverages domain-specific features (e.g., control flow graphs and dynamic execution traces) in transformer-based approaches to improve binary code semantic understanding. However, this approach involves complex feature engineering, a cumbersome and time-consuming process that can introduce predictive uncertainty when dealing with stripped or obfuscated code, leading to a performance drop. In this paper, we introduce ProTST, a novel transformer-based methodology for binary code embedding. ProTST employs a hierarchical training process based on a unique tree-like structure, where knowledge progressively flows from fundamental tasks at the root to more specialized tasks at the leaves. This progressive teacher-student paradigm allows the model to build upon previously learned knowledge, resulting in high-quality embeddings that can be effectively leveraged for diverse downstream binary analysis tasks. The effectiveness of ProTST is evaluated in seven binary analysis tasks, and the results show that ProTST yields an average validation score (F1 and MRR) improvement of 14.8% compared to traditional two-stage training and an average validation score of 13.4% compared to multimodal two-stage frameworks.

Hanxiao Lu

Columbia University

United States

Hongyu Cai

Purdue University

United States

Yiming Liang

Purdue University

United States

Antonio Bianchi

Purdue University

United States

Z. Berkay Celik

Purdue University

United States

Time Zone

The program is currently displayed in (GMT-05:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-05:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 6 Mar
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30	Program AnalysisResearch Papers at M-1410 Chair(s): Rrezarta Krasniqi University of North Carolina at Charlotte

11:00 15m Talk		Adapting Knowledge Prompt Tuning for Enhanced Automated Program Repair Research Papers Xuemeng Cai Singapore Management University, Lingxiao Jiang Singapore Management University Pre-print
11:15 15m Talk		A Metric for Measuring the Impact of Rare Paths on Program Coverage Research Papers Leo St. Amour Virginia Tech, Eli Tilevich Virginia Tech, Muhammad Ali Gulzar Virginia Tech
11:30 15m Talk		A Progressive Transformer for Unifying Binary Code Embedding and Knowledge Transfer Research Papers Hanxiao Lu Columbia University, Hongyu Cai Purdue University, Yiming Liang Purdue University, Antonio Bianchi Purdue University, Z. Berkay Celik Purdue University
11:45 15m Talk		Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry Research Papers Andrea Gurioli DISI - University of Bologna, Maurizio Gabbrielli DISI - University of Bologna, Stefano Zacchiroli Télécom Paris, Polytechnic Institute of Paris Pre-print
12:00 15m Talk		SpeedGen: Enhancing Code Efficiency through Large Language Model-Based Performance Optimization Research Papers Nils Purschke Technical University of Munich, Sven Kirchner Technical University of Munich, Alois Knoll Technical University of Munich
12:15 15m Talk		StriCT-BJ: A String Constraint Benchmark from Real Java Programs Research Papers Chi Zhang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences, Jian Zhang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences