Learning Heterogeneous Abstract Code Graph Representations For Program Comprehension
Program comprehension is a fundamental activity in the field of software engineering. However, efficiently and accurately understanding source code poses significant challenges, as source code with similar semantics can differ in syntax. Recent state-of-the-art research has demonstrated that combining deep learning techniques with structural information from source code, specifically AST-based static graphs, can enhance the extraction of essential features from source programs. Control flow and data flow information in source programs can express richer semantics while existing studies often overlook their heterogeneous integration when constructing program static graphs. This oversight results in the loss of information about the type of static graph edges, potentially impeding program comprehension.
In this paper, We model the source program by using a heterogeneous static graph and then use Relational Graph Convolutional Network (R-GCN) for feature extraction. Specifically, we present an innovative method for constructing a program static graph, termed the Heterogeneous Abstract Code Graph (HACG), and then we employ R-GCN to generate representations based on HACG for code classification and code clone detection. We evaluate our method using two extensive source code datasets: CodeNet, introduced by IBM, and BigCloneBench. The experimental results demonstrate the superiority of our approach over existing methods, achieving a code classification accuracy of 97.38% and an average F1-score of 98.34% in code clone detection.