ICSME 2025
Sun 7 - Fri 12 September 2025 Auckland, New Zealand

This program is tentative and subject to change.

Fri 12 Sep 2025 13:30 - 13:45 at Case Room 3 260-055 - Session 15 - Reuse 2 Chair(s): Elliott Wen

As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten the risk of propagating vulnerabilities, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) have emerged as the predominant representation in deep learning-based code clone detection models, owing to their ability to meticulously capture the syntactic structure of programs. However, ASTs are inherently limited, as they primarily encode syntactic information and often fail to capture the deeper semantic relationships inherent in code. To overcome this limitation, recent studies have sought to enrich AST-based representations by integrating additional semantic information, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs), thereby constructing more comprehensive graph-based representations. These enriched models aim to enhance the accuracy and robustness of code clone detection. Despite these advancements, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection.

In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations—combining ASTs with control flow (CFG) and data flow (DFG) information—and assess their impact on detection accuracy across multiple GNN architectures. Our experiments, conducted on the widely-used BigCloneBench dataset, reveal that hybrid representations influence GNN performance in distinct and often unexpected ways. While the integration of AST + CFG + DFG consistently boosts accuracy for convolution- and attention-based models like GCN and GAT, flow-augmented ASTs (FA-AST) introduce unnecessary structural complexity that frequently degrades performance. Among the models evaluated, the Graph Matching Network (GMN) emerges as the clear standout, achieving superior performance and at a lower computational cost–even with the standard AST representation. GMN delivers the highest or near-highest results across all metrics, demonstrating its exceptional ability to capture cross-code similarities–a critical factor for accurate clone detection. This outstanding performance underscores GMN’s robustness and efficiency, effectively reducing the need for enriched representations.

Our findings offer actionable insights for designing graph-based representations in code clone detection, emphasizing that not all hybrid representations yield improvements and that model architecture plays a pivotal role in leveraging these representations effectively. This study also provides a practical roadmap for optimizing representation and model selection in real-world applications.

This program is tentative and subject to change.

Fri 12 Sep

Displayed time zone: Auckland, Wellington change

13:30 - 15:00
Session 15 - Reuse 2NIER Track / Industry Track / Research Papers Track at Case Room 3 260-055
Chair(s): Elliott Wen The University of Auckland
13:30
15m
AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection
Research Papers Track
Zixian Zhang School of Computer Science, University of Galway, Takfarinas Saber School of Computer Science, University of Galway
13:45
10m
Client–Library Compatibility Testing with API Interaction Snapshots
NIER Track
Gustave Monce Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI, Thomas Degueule CNRS, Jean-Rémy Falleri Univ. Bordeaux, Bordeaux INP, CNRS, LaBRI. Institut Universitaire de France., Romain Robbes CNRS, LaBRI, University of Bordeaux
Pre-print
13:55
10m
Prompting Matters: Assessing the Effect of Prompting Techniques on LLM-Generated Class Code
NIER Track
Adam Yuen University of Calgary, John Pangas University of Calgary, Md Mainul Hasan Polash University of Calgary, Ahmad Abdellatif University of Calgary
14:05
10m
From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption
NIER Track
Nitika Chopra Trent University, Taher A. Ghaleb Trent University
Pre-print
14:15
15m
Automated Recovery of Software Product Lines from Legacy Configurable Codebases
Industry Track
Tewfik Ziadi University of Doha for Science and Technology (UDST), Karim Ghallab Sorbonne Université - RedFabriQ/Mobioos, Zaak Chalal RedFabriQ/Mobioos
14:30
15m
Integrating Rules and Semantics for LLM-Based C-to-Rust Translation
Industry Track
Feng Luo Harbin Institute of Technology (Shenzhen), Kexing Ji Harbin Institute of Technology (Shenzhen), Cuiyun Gao Harbin Institute of Technology, Shuzheng Gao Chinese University of Hong Kong, jiafeng Harbin Institute of Technology (Shenzhen), Kui Liu Huawei, Xin Xia Zhejiang University, Michael Lyu The Chinese University of Hong Kong