Prism: Decomposing Program Semantics for Code Clone Detection through Compilation
Code clone detection (CCD) is of critical importance in software engineering, while semantic similarity is a key evaluation factor for CCD. The embedding technique, which represents an object using a numerical vector, is utilized to generate code representations, where code snippets with similar semantics (clone pairs) should have similar vectors. However, due to the diversity and flexibility of high-level program languages, the code representation of clone pairs may be inconsistent. Assembly code provides the program execution trace and can normalize the diversity of high-level languages in terms of the program behavior semantics. After revisiting the assembly language, we find that different assembly codes can align with the computational logic and memory access patterns of cloned pairs. Therefore, the use of multiple assembly languages can capture the behavior semantics to enhance the understanding of programs. Thus, we propose Prism, a new method for code clone detection fusing behavior semantics from multiple architecture assembly code, which directly captures multilingual domains’ syntax and semantic information. Additionally, we introduce a multi-feature fusion strategy that leverages global information interaction to expand the representation space. This fusion process allows us to capture the complementary information from each feature and leverage the relationships between them to create a more expressive representation of the code. After testing the OJClone dataset, the Prism model exhibited exceptional performance with precision and recall scores of 0.999 and 0.999, respectively. Additionally, behavior semantics is incorporated into the prior model, leading to improved clone detection performance.
Wed 17 AprDisplayed time zone: Lisbon change
11:00 - 12:30 | |||
11:00 15mTalk | Prism: Decomposing Program Semantics for Code Clone Detection through Compilation Research Track Haoran Li Nankai university, wangsiqian Nankai university, Weihong Quan Nankai university, Xiaoli Gong Nankai University, Huayou Su NUDT, Jin Zhang Hunan Normal University | ||
11:15 15mTalk | Evaluating Code Summarization Techniques: A New Metric and an Empirical Characterization Research Track Antonio Mastropaolo Università della Svizzera italiana, Matteo Ciniselli Università della Svizzera Italiana, Massimiliano Di Penta University of Sannio, Italy, Gabriele Bavota Software Institute @ Università della Svizzera Italiana | ||
11:30 15mTalk | Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot Research Track David OBrien Iowa State University, Sumon Biswas Carnegie Mellon University, Sayem Mohammad Imtiaz Iowa State University, Rabe Abdalkareem Omar Al-Mukhtar University, Emad Shihab Concordia University, Hridesh Rajan Iowa State University | ||
11:45 15mTalk | Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization) Research Track Toufique Ahmed University of California at Davis, Kunal Suresh Pai UC Davis, Prem Devanbu University of California at Davis, Earl T. Barr University College London DOI Pre-print | ||
12:00 15mTalk | DSFM: Enhancing Functional Code Clone Detection with Deep Subtree Interactions Research Track Zhiwei Xu Tsinghua University, Shaohua Qiang Tsinghua University, Dinghong Song Tsinghua University, Min Zhou Tsinghua University, Hai Wan Tsinghua University, Xibin Zhao Tsinghua University, Ping Luo Tsinghua University, Hongyu Zhang Chongqing University | ||
12:15 15mTalk | Machine Learning is All You Need: A Simple Token-based Approach for Effective Code Clone Detection Research Track Siyue Feng Huazhong University of Science and Technology, Wenqi Suo Huazhong University of Science and Technology, Yueming Wu Nanyang Technological University, Deqing Zou Huazhong University of Science and Technology, Yang Liu Nanyang Technological University, Hai Jin Huazhong University of Science and Technology |