Prism: Decomposing Program Semantics for Code Clone Detection through Compilation (ICSE 2024 - Research Track)

Who

Haoran Li, wangsiqian , Weihong Quan, Xiaoli Gong, Huayou Su, Jin Zhang

Track

ICSE 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Apr 2024 11:00 - 11:15 at Amália Rodrigues - Evolution & AI Chair(s): Oscar Chaparro

Abstract

Code clone detection (CCD) is of critical importance in software engineering, while semantic similarity is a key evaluation factor for CCD. The embedding technique, which represents an object using a numerical vector, is utilized to generate code representations, where code snippets with similar semantics (clone pairs) should have similar vectors. However, due to the diversity and flexibility of high-level program languages, the code representation of clone pairs may be inconsistent. Assembly code provides the program execution trace and can normalize the diversity of high-level languages in terms of the program behavior semantics. After revisiting the assembly language, we find that different assembly codes can align with the computational logic and memory access patterns of cloned pairs. Therefore, the use of multiple assembly languages can capture the behavior semantics to enhance the understanding of programs. Thus, we propose Prism, a new method for code clone detection fusing behavior semantics from multiple architecture assembly code, which directly captures multilingual domains’ syntax and semantic information. Additionally, we introduce a multi-feature fusion strategy that leverages global information interaction to expand the representation space. This fusion process allows us to capture the complementary information from each feature and leverage the relationships between them to create a more expressive representation of the code. After testing the OJClone dataset, the Prism model exhibited exceptional performance with precision and recall scores of 0.999 and 0.999, respectively. Additionally, behavior semantics is incorporated into the prior model, leading to improved clone detection performance.

Haoran Li

Nankai university

wangsiqian

Nankai university

Weihong Quan

Nankai university

Xiaoli Gong

Nankai University

Huayou Su

NUDT

Jin Zhang

Hunan Normal University

China

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 17 Apr
Displayed time zone: Lisbon change

11:00 - 12:30	Evolution & AIResearch Track at Amália Rodrigues Chair(s): Oscar Chaparro William & Mary

11:00 15m Talk		Prism: Decomposing Program Semantics for Code Clone Detection through Compilation Research Track Haoran Li Nankai university, wangsiqian Nankai university, Weihong Quan Nankai university, Xiaoli Gong Nankai University, Huayou Su NUDT, Jin Zhang Hunan Normal University
11:15 15m Talk		Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization Research Track Antonio Mastropaolo Università della Svizzera italiana, Matteo Ciniselli Università della Svizzera Italiana, Massimiliano Di Penta University of Sannio, Italy, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
11:30 15m Talk		Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot Research Track David OBrien Iowa State University, Sumon Biswas Carnegie Mellon University, Sayem Mohammad Imtiaz Iowa State University, Rabe Abdalkareem Omar Al-Mukhtar University, Emad Shihab Concordia University, Hridesh Rajan Iowa State University
11:45 15m Talk		Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) Research Track Toufique Ahmed University of California at Davis, Kunal Suresh Pai UC Davis, Prem Devanbu University of California at Davis, Earl T. Barr University College London DOI Pre-print
12:00 15m Talk		DSFM: Enhancing Functional Code Clone Detection with Deep Subtree Interactions Research Track Zhiwei Xu Tsinghua University, Shaohua Qiang Tsinghua University, Dinghong Song Tsinghua University, Min Zhou Tsinghua University, Hai Wan Tsinghua University, Xibin Zhao Tsinghua University, Ping Luo Tsinghua University, Hongyu Zhang Chongqing University
12:15 15m Talk		Machine Learning is All You Need: A Simple Token-based Approach for Effective Code Clone Detection Research Track Siyue Feng Huazhong University of Science and Technology, Wenqi Suo Huazhong University of Science and Technology, Yueming Wu Nanyang Technological University, Deqing Zou Huazhong University of Science and Technology, Yang Liu Nanyang Technological University, Hai Jin Huazhong University of Science and Technology