CONCORD: A DSL for Generating Simplified and Scalable Graph-Based Code Representations
This program is tentative and subject to change.
Graph-based representations have gained attention for their ability to model structural and semantic information capturing relevant characteristics and features of source code for training deep learning models. However, existing methods face limitations: they lack flexibility in constructing cross-language graphs, produce non-interoperable outputs, and generate excessively large graphs, hindering their adoption and raising scalability and efficiency issues in graph-based neural network training.
In this work, we introduce CONCORD, a domain-specific language (DSL), to address these challenges. We aim to 1) enable customizable graph-based code representations across programming languages, 2) reduce graph size complexity through simplification heuristics, and 3) improve scalability and reproducibility in software engineering tasks.
CONCORD provides a configurable DSL to automate graph construction and implements heuristics to reduce graph size while preserving critical information. We evaluate its effectiveness on two tasks: code smell detection and vulnerability detection. For each, we compare performance and graph size against baseline representations without simplification heuristics.
On code smell detection, CONCORD preserved 95.1% of baseline performance, and exceeded it by 5% in one setting, while reducing, on average, the number of nodes and edges by 13.11% and 13.67%, respectively. For vulnerability detection, it improved performance by 3.65% over the baseline while reducing the number of nodes and edges by 3.63% and 3.62%, respectively. This demonstrates that CONCORD’s heuristics maintain or enhance performance while improving scalability.
CONCORD represents a step towards advancing graph-based code analysis by offering a flexible, language-agnostic approach to generate streamlined code representations. Its simplification heuristics balance performance and scalability, enabling efficient training of neural models without sacrificing accuracy. In addition, it reduces development overhead, promotes reproducibility through standardized representations, and broadens accessibility to graph-based methods for software engineering tasks.
This program is tentative and subject to change.
Thu 19 MarDisplayed time zone: Athens change
11:00 - 12:30 | Session 4A - Code Representation and AnalysisResearch Track / Tool Demo Track | ||
11:00 12mTalk | GDPO: Dual Learning for Self-Supervised Code Summarization in the Era of Large Language Models Research Track Chen Xiao , Wang Shuwei Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Zhang Weize Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Jiang Zhengwei Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences, Wang Qiuyun Institute of Information Engineering, Chinese Academy of Sciences;and University of Chinese Academy of Sciences | ||
11:12 12mTalk | Mind the Merge: Evaluating the Effects of Token Merging on Pre-trained Models for Code Research Track Mootez Saad Dalhousie University, Hao Li Queen's University, Tushar Sharma Dalhousie University, Ahmed E. Hassan Queen’s University | ||
11:25 12mTalk | CONCORD: A DSL for Generating Simplified and Scalable Graph-Based Code Representations Research Track Pre-print | ||
11:38 12mTalk | Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition Research Track Denis Neumüller Ulm University, Sebastian Boll Ulm University, David Schüler Ulm University, Matthias Tichy Ulm University | ||
11:51 12mTalk | A Multi-Modal Retrieval-Augmented Framework for Compiler Backend Generation with LLMs Research Track Ming Zhong SKLP, Institute of Computing Technology, CAS, Fang Lv Institute of Computing Technology, Chinese Academy of Sciences, Hongna Geng , Xin Sun , Lulin Wang , Lulin Wang , Huimin Cui Institute of Computing Technology, Chinese Academy of Sciences, Xiaobing Feng ICT CAS | ||
12:04 12mTalk | AdaptVM: An LLVM-Based Function-Adaptive Code Virtualizer Tool Demo Track | ||
12:17 12mTalk | Static Analysis assisted Knowledge Graph based Automatic Functionality Discovery for Mainframe Applications Tool Demo Track Sasaank Janapati , Atul Kumar IBM Research India, Nandakishore S Menon IBM Research India, Sridhar Chimalakonda Indian Institute of Technology Tirupati | ||