Unified Abstract Syntax Tree Representation Learning for Cross-language Program Classification (ICPC 2022 - Research)

Who

Kesu Wang, Meng Yan, He Zhang, Haibo Hu

Track

ICPC 2022 Research

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 16 May 2022 21:21 - 21:28 at ICPC room - Session 9: Program Representation 2 Chair(s): Lingxiao Jiang

Abstract

Program classification can be regarded as a high-level abstraction of code, laying a foundation for various tasks related to source code comprehension, and has a very wide range of applications in the field of software engineering, such as code clone detection, code smell classification, defects classification, etc. The cross-language program classification can realize code transfer in different programming languages, and can also promote cross-language code reuse, thereby helping developers to write code quickly and reduce the development time of code transfer. Most of the existing studies focus on the semantic learning of the code, whilst few studies are devoted to cross-language tasks. The main challenge of cross-language program classification is how to extract semantic features of different programming languages. In order to cope with this difficulty, we propose a Unified Abstract Syntax Tree (namely UAST in this paper) neural network. In detail, the core idea of UAST consists of two unified mechanisms. First, UAST learns an AST representation by unifying the AST traversal sequence and graph-like AST structure for capturing semantic code features. Second, we construct a mechanism called unified vocabulary, which can reduce the feature gap between different programming languages, so it can achieve the role of cross-language program classification. Besides, we collect a dataset containing 20,000 files of five programming languages, which can be used as a benchmark dataset for the cross-language program classification task. We have done experiments on two datasets, and the results show that our proposed approach outperforms the state-of-the-art baselines in terms of four evaluation metrics (Precision, Recall, F1-score, and Accuracy).

Kesu Wang

Nanjing University

Meng Yan

Chongqing University

He Zhang

Nanjing University

China

Haibo Hu

Chongqing University

Media

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 16 May
Displayed time zone: Eastern Time (US & Canada) change

21:00 - 21:50	Session 9: Program Representation 2Research at ICPC room Chair(s): Lingxiao Jiang Singapore Management University

21:00 7m Talk		HELoC: Hierarchical Contrastive Learning of Source Code Representation Research Xiao Wang Shandong Normal University, Qiong Wu Shandong Normal University, Hongyu Zhang University of Newcastle, Chen Lyu Shandong Normal University, Xue Jiang Shandong Normal University, Zhuoran Zheng Nanjing University of Science and Technology, Lei Lyu Shandong Normal University, Songlin Hu Institute of Information Engineering, Chinese Academy of Sciences Media Attached
21:07 7m Talk		Exploring GNN Based Program Embedding Technologies for Binary related Tasks Research YixinGuo Peking University, Pengcheng Li Google, Inc, Yingwei Luo Peking University, Xiaolin Wang Peking University, Zhenlin Wang Michigan Technological University Media Attached
21:14 7m Talk		Learning Heterogeneous Type Information in Program Graphs Research Kechi Zhang Peking University, Wenhan Wang Nanyang Technological University, Huangzhao Zhang Peking University, Ge Li Peking University, Zhi Jin Peking University DOI Pre-print Media Attached
21:21 7m Talk		Unified Abstract Syntax Tree Representation Learning for Cross-language Program Classification Research Kesu Wang Nanjing University, Meng Yan Chongqing University, He Zhang Nanjing University, Haibo Hu Chongqing University Media Attached
21:28 7m Talk		On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages Research Fuxiang Chen University of British Columbia, Fatemeh Hendijani Fard University of British Columbia, David Lo Singapore Management University, Timofey Bryksin JetBrains Research; HSE University Pre-print Media Attached
21:35 15m Live Q&A		Q&A-Paper Session 9 Research

Information for Participants

Mon 16 May 2022 21:00 - 21:50 at ICPC room - Session 9: Program Representation 2 Chair(s): Lingxiao Jiang

Info for room ICPC room:

Click here to go to the room on Midspace