KEENHash: Hashing Programs into Function-aware Embeddings for Large-scale Binary Code Similarity Analysis (ISSTA 2025 - Research Papers)

Who

Zhijie Liu, Qiyi Tang, Sen Nie, Shi Wu, Liangfeng Zhang, Yutian Tang

Track

ISSTA 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 27 Jun 2025 11:00 - 11:25 at Cosmos 3C - LLM-based Testing 2 Chair(s): Jie M. Zhang

Abstract

Binary code similarity analysis (BCSA) is a crucial research area in many fields such as cybersecurity. Specifically, function-level diffing tools are the most widely used in BCSA: they perform (similar) function matching one by one for evaluating the similarity between binary programs (binaries). However, such methods need a high time complexity, making it unscalable in large-scale scenarios (e.g., 1/n-to-n searching). Towards effective and efficient program-level BCSA, we propose KEENHash, a novel hashing approach that hashes binaries into program-level representations through large language model (LLM)-generated function embeddings. KEENHash condenses a binary into one compact and fixed-length program embedding using K-Means and Feature Hashing, allowing us to do effective and efficient large-scale program-level BCSA, surpassing the previous state-of-the-art methods. The experimental results show that KEENHash is 215 times faster than the state-of-the-art function matching tool while maintaining effectiveness. Furthermore, in a large-scale scenario with 5.3 billion similarity evaluations, KEENHash takes only 395.83 seconds while the tool will cost 56 days. We also evaluate KEENHash on the program clone search of large-scale BCSA across extensive datasets in 202,305 binaries. Compared with 4 state-of-the-art methods, KEENHash outperforms all of them by at least 23.16%, and displays remarkable superiority over them in the large-scale BCSA security scenario of malware detection.

DOI

https://doi.org/10.1145/3728911

Zhijie Liu

ShanghaiTech University, China

China

Qiyi Tang

Tencent Security Keen Lab

China

Sen Nie

Tencent Security Keen Lab

China

Shi Wu

Tencent Security Keen Lab

China

Liangfeng Zhang

School of Information Science and Technology, ShanghaiTech University

Yutian Tang

University of Glasgow, United Kingdom

United Kingdom

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 27 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:15	LLM-based Testing 2Research Papers at Cosmos 3C Chair(s): Jie M. Zhang King's College London

11:00 25m Talk		KEENHash: Hashing Programs into Function-aware Embeddings for Large-scale Binary Code Similarity Analysis Research Papers Zhijie Liu ShanghaiTech University, China, Qiyi Tang Tencent Security Keen Lab, Sen Nie Tencent Security Keen Lab, Shi Wu Tencent Security Keen Lab, Liangfeng Zhang School of Information Science and Technology, ShanghaiTech University, Yutian Tang University of Glasgow, United Kingdom DOI
11:25 25m Talk		Porting Software Libraries to OpenHarmony: Transitioning from TypeScript or JavaScript to ArkTS Research Papers Bo Zhou Northeastern University, Jiaqi Shi Northeastern University, Ying Wang Northeastern University, Li Li Beihang University, Li Tsz On The Hong Kong University of Science and Technology, Hai Yu Northeastern University, China, Zhiliang Zhu Northeastern University, China DOI
11:50 25m Talk		STRUT: Structured Seed Case Guided Unit Test Generation for C Programs using LLMs Research Papers Jinwei Liu Xidian University, Chao Li Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Rui Chen Beijing Institute of Control Engineering; Beijing Sunwise Information Technology, Shaofeng Li Xidian University, Bin Gu Beijing Institute of Control Engineering, Mengfei Yang China Academy of Space Technology DOI

Information for Participants

Fri 27 Jun 2025 11:00 - 12:15 at Cosmos 3C - LLM-based Testing 2 Chair(s): Jie M. Zhang

Info for room Cosmos 3C:

Cosmos 3C is the third room in the Cosmos 3 wing.

When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.