AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection (ICSME 2025 - Research Papers Track) - ICSME 2025 - International Conference on Software Maintenance and Evolution

Who

Zixian Zhang, Takfarinas Saber

Track

ICSME 2025 Research Papers Track

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+12:00) Auckland, Wellington.

Use conference time zone: (GMT+12:00) Auckland, WellingtonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 12 Sep 2025 13:30 - 13:45 at Case Room 3 260-055 - Session 15 - Reuse 2 Chair(s): Elliott Wen

Abstract

As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten the risk of propagating vulnerabilities, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) have emerged as the predominant representation in deep learning-based code clone detection models, owing to their ability to meticulously capture the syntactic structure of programs. However, ASTs are inherently limited, as they primarily encode syntactic information and often fail to capture the deeper semantic relationships inherent in code. To overcome this limitation, recent studies have sought to enrich AST-based representations by integrating additional semantic information, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs), thereby constructing more comprehensive graph-based representations. These enriched models aim to enhance the accuracy and robustness of code clone detection. Despite these advancements, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection.

In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations—combining ASTs with control flow (CFG) and data flow (DFG) information—and assess their impact on detection accuracy across multiple GNN architectures. Our experiments, conducted on the widely-used BigCloneBench dataset, reveal that hybrid representations influence GNN performance in distinct and often unexpected ways. While the integration of AST + CFG + DFG consistently boosts accuracy for convolution- and attention-based models like GCN and GAT, flow-augmented ASTs (FA-AST) introduce unnecessary structural complexity that frequently degrades performance. Among the models evaluated, the Graph Matching Network (GMN) emerges as the clear standout, achieving superior performance and at a lower computational cost–even with the standard AST representation. GMN delivers the highest or near-highest results across all metrics, demonstrating its exceptional ability to capture cross-code similarities–a critical factor for accurate clone detection. This outstanding performance underscores GMN’s robustness and efficiency, effectively reducing the need for enriched representations.

Our findings offer actionable insights for designing graph-based representations in code clone detection, emphasizing that not all hybrid representations yield improvements and that model architecture plays a pivotal role in leveraging these representations effectively. This study also provides a practical roadmap for optimizing representation and model selection in real-world applications.

Zixian Zhang

School of Computer Science, University of Galway

Ireland

Takfarinas Saber