Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection (ICPC 2023 - Research)

Who

Subroto Nag Pinku, Debajyoti Mondal, Chanchal K. Roy

Track

ICPC 2023 Research

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 May 2023 09:54 - 10:03 at Meeting Room 106 - Keynote / Code Analysis Chair(s): Christoph Treude, Nicolás Cardozo, Raula Gaikovina Kula, Chaiyong Rakhitwetsagul

Abstract

Software clones are often introduced when developers reuse code fragments to implement similar functionalities in the same or different software systems. Many high-performing clone detection tools today are based on deep learning techniques and are mostly used for detecting clones written in the same programming language, whereas clone detection tools for detecting cross-language clones are also emerging rapidly. The popularity of deep learning-based clone detection tools creates an opportunity to investigate how known strategies that boost the performances of deep learning models could be further leveraged to improve clone detection tools. In this paper, we investigate such a strategy, data augmentation, which has not yet been explored for cross-language clone detection as opposed to single-language clone detection. We show how the existing knowledge on transcompilers (source-to-source translators) can be used for data augmentation to boost the performance of cross-language clone detection models, as well as to adapt single-language clone detection models to create cross-language clone detection pipelines. To demonstrate the performance boost for cross-language clone detection through data augmentation, we exploit Transcoder, which is a pre-trained source-to-source translator. To show how to extend single-language models for cross-language clone detection, we extend a popular single-language model, Graph Matching Network (GMN) in a combination with the transcompilers. We evaluated our models on popular benchmark datasets. Our experimental results showed improvements in F1 scores (sometimes up to 3%) for the cutting-edge cross-language clone detection models. Even when extending GMN for cross-language clone detection, the models built leveraging data augmentation outperformed the baseline with scores of 0.90, 0.92, and 0.91 for precision, recall, and F1 score, respectively.

Link to Preprint

https://arxiv.org/abs/2303.01435

Subroto Nag Pinku

University of Saskatchewan

Canada

Debajyoti Mondal

University of Saskatchewan

Canada

Chanchal K. Roy

University of Saskatchewan

Canada

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 May
Displayed time zone: Hobart change

09:00 - 10:30	Keynote / Code AnalysisDiscussion / Tool Demonstration / Research / Early Research Achievements (ERA) / ICPC Keynotes at Meeting Room 106 Chair(s): Christoph Treude University of Melbourne, Nicolás Cardozo Universidad de los Andes, Raula Gaikovina Kula Nara Institute of Science and Technology, Chaiyong Rakhitwetsagul Mahidol University, Thailand

09:00 45m Keynote		Kobi Leins: Guidance on more than just standing upright to create safe models, software and use of data ICPC Keynotes
09:45 9m Full-paper		Implant Global and Local Hierarchy Information to Sequence based Code Representation Models Research Kechi Zhang Peking University, China, Zhuo Li , Zhi Jin Peking University, Ge Li Peking University Pre-print
09:54 9m Full-paper		Pathways to Leverage Transcompiler based Data Augmentation for Cross-Language Clone Detection Research Subroto Nag Pinku University of Saskatchewan, Debajyoti Mondal University of Saskatchewan, Chanchal K. Roy University of Saskatchewan Pre-print
10:03 5m Short-paper		Investigating the Generalizability of Deep Learning-based Clone Detectors Early Research Achievements (ERA) Eunjong Choi Kyoto Institute of Technology, Norihiro Fuke Osaka University, Yuji Fujiwara Osaka University, Norihiro Yoshida Ritsumeikan University, Katsuro Inoue Nanzan University
10:08 5m Short-paper		UnityLint: A Bad Smell Detector for Unity Tool Demonstration Matteo Bosco University of Sannio, Italy, Pasquale Cavoto University of Sannio, Italy, Augusto Ungolo University of Sannio, Italy, Biruk Asmare Muse Polytechnique Montréal, Foutse Khomh Polytechnique Montréal, Vittoria Nardone , Massimiliano Di Penta University of Sannio, Italy Pre-print
10:13 17m Panel		Discussion 5 Discussion