FSE 2025
Mon 23 - Fri 27 June 2025 Trondheim, Norway
co-located with ISSTA 2025
Mon 23 Jun 2025 11:30 - 11:50 at Aurora A - Clones Chair(s): Julia Lawall

This abstract presents an overview of our published research on measuring code clone detection models’ alignment with human-expert intuition (https://doi.org/10.1007/s10664-024-10583-0). Our research aims to evaluate the alignment between model behavior and human-expert understanding in the domain of semantic clone detection through a robust causal inference method. We aim to provide a true estimate of a model’s semantic clone detection performance based on salient code features. To this end, we introduce and implement a causal interpretation framework based on the Neyman-Rubin causal model to understand and gain insight into the factors influencing the decision-making of code models for semantic code clone detection. We systematically investigate the causal relationships between code mutations and model predictions, which allows us to determine whether a model’s behavior is based on actual semantic similarities or confounded by irrelevant code features. Our innovation is the use of expert-annotated code for generating counterfactual code mutations to obtain \textit{causal explanations} of a model’s clone prediction. Our approach allows for a robust evaluation of how changes in input affect the model’s behavior, aligning with the principles of causal inference to provide insightful analyses of model performance and robustness.

Figure 1 provides an overview of our evaluation methodology. The figure shows a human evaluator performing clone and code labeling on original method pairs ($o$) in our data set. We then perform label resolutions; done manually for clone labels and using automated means for code labels. The original method pairs are passed to the clone detection models to get model predictions $P_{o}$. Using $P_{o}$, we derive confusion matrices depicting the number of correct and incorrect prediction results. We then perform code mutations on the method pairs by applying a specific mutation (as an intervention $\Delta_{m}$) to a method pair depending on model prediction outcomes. We use our mutation generation technique to generate code pair mutations based on code labels. The mutated code pairs ($\mu$) are passed to clone detection models to get predictions ($P_{\mu}$). As part of our causal inference mechanism, we check whether the prediction after intervention ($P_{\mu}$) matches our expected hypothetical outcome ($P_{h}$). From this process, we derive causal explanations of each model’s predictions. Finally, we measure the causal effects of our interventions and use our alignment metric to quantify how well a model’s reasoning for semantic code clone detection aligns with human intuition.

For our experimental evaluation, we curate a dataset of 280 Java code pairs and assign a label to each code pair depicting its clone status. We also manually label code segments in a code pair to mark \textit{core} and/or \textit{non-core} causes of similarities and/or differences as per human understanding. Using these human-annotated causes of similarities or differences, we perform counterfactual code mutations by removing annotated code. We then input the mutated code pairs to the various code clone detection models and compare the predictions on mutated code pairs against predefined hypothetical outcomes. The model predictions on the mutated method pairs that match the hypothetical outcomes help to identify the causes of prediction for the model and we can obtain causal explanations of the models’ predictions. We compare a state-of-the-art high-speed semantic clone detection model CodeGraph4CCDetector against three state-of-the-art large language models trained on code including CodeBERT, CodeT5, and GPT-Turbo-3.5 and evaluate how well various model-reasonings align with human-expert reasoning. We evaluate each model’s alignment by aggregating individual quality metrics and find out the most aligned model for semantic code clone detection. Individual quality metrics measure the average causal effects of code mutations on models’ predictions, models’ similarity intuition alignment, robustness to confounding influence, and their prediction consistency. Our results indicate that a model’s level of accuracy on semantic clone detection does not correlate with its level of alignment with human intuition. Thus, we observed that while GPT-Turbo-3.5 had the least accuracy, it had the highest alignment score.

Through a comprehensive understanding of the causes of various models’ decision-making, we aspire to lay the foundation for more trustworthy semantic code clone detection systems. To support the replication of our experiment and results, we have uploaded our code, our labeled data, our code and clone labeling tool plugin for the Visual Studio Code IDE, and the results of our experiments online on GitHub (https://github.com/shamsa-abid/Code-Clone-Causal-Interpretation).

Mon 23 Jun

Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

10:30 - 12:30
10:30
20m
Talk
An empirical study of business process models and model clones on GitHub
Journal First
Mahdi Saeedi Nikoo Eindhoven University of Technology, Sangeeth Kochanthara Netherlands' Space Obervatory - ASTRON, Önder Babur Eindhoven University of Technology, Mark van den Brand Eindhoven University of Technology
10:50
20m
Talk
The Struggles of LLMs in Cross-lingual Code Clone Detection
Research Papers
Micheline Bénédicte MOUMOULA University of Luxembourg, Abdoul Kader Kaboré University of Luxembourg, Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg
DOI
11:10
20m
Talk
Clone Detection for Smart Contracts: How Far Are We?
Research Papers
Zuobin Wang Zhejiang University, Zhiyuan Wan Zhejiang University, Yujing Chen Zhejiang University, Yun Zhang Hangzhou City University, David Lo Singapore Management University, Difan Xie Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Xiaohu Yang Zhejiang University
DOI
11:30
20m
Talk
Measuring Model Alignment for Code Clone Detection Using Causal Interpretation
Journal First
Shamsa Abid National University of Computer and Emerging Sciences, Xuemeng Cai Singapore Management University, Lingxiao Jiang Singapore Management University
11:50
20m
Talk
An Empirical Study of Code Clones from Commercial AI Code Generators
Research Papers
Weibin Wu Sun Yat-sen University, Haoxuan Hu Sun Yat-sen University, China, Zhaoji Fan Sun Yat-sen University, Yitong Qiao Sun Yat-sen University, China, Yizhan Huang The Chinese University of Hong Kong, Yichen LI The Chinese University of Hong Kong, Zibin Zheng Sun Yat-sen University, Michael Lyu Chinese University of Hong Kong
DOI
12:10
20m
Talk
VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity
Journal First
S. VenkataKeerthy IIT Hyderabad, Soumya Banerjee IIT Hyderabad, Sayan Dey IIT Hyderabad, Yashas Andaluri IIT Hyderabad, Raghul PS IIT Hyderabad, Subrahmanyam Kalyanasundaram IIT Hyderabad, Fernando Magno Quintão Pereira Federal University of Minas Gerais, Ramakrishna Upadrasta IIT Hyderabad

Information for Participants
Mon 23 Jun 2025 10:30 - 12:30 at Aurora A - Clones Chair(s): Julia Lawall
Info for room Aurora A:

Aurora A is the first room in the Aurora wing.

When facing the main Cosmos Hall, access to the Aurora wing is on the right, close to the side entrance of the hotel.

:
:
:
: