On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages (ICPC 2022 - Research)

Who

Fuxiang Chen, Fatemeh Hendijani Fard, David Lo, Timofey Bryksin

Track

ICPC 2022 Research

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 16 May 2022 21:28 - 21:35 at ICPC room - Session 9: Program Representation 2 Chair(s): Lingxiao Jiang

Abstract

Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) downstream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject.

A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.

In this work, we analyze over a hundred of trained and fine-tuned models, and our results show that 1) multilingual PLMs have a higher time-to-performance ratio (the duration of fine-tuning over BLEU, METEOR, or MRR scores) as compared to monolingual PLMs, 2) our proposed strategy to select target programming languages to fine-tune multilingual PLMs is effective — it not only reduces the time to fine-tune but also achieves higher performance in Code Summarization and Code Search tasks, and 3) our proposed strategy consistently shows good performance on different code lengths.

Link to Preprint

https://arxiv.org/pdf/2204.09653.pdf

Fuxiang Chen

University of British Columbia

Fatemeh Hendijani Fard

University of British Columbia

Canada

David Lo

Singapore Management University

Singapore

Timofey Bryksin

JetBrains Research; HSE University

Russia

Media

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 16 May
Displayed time zone: Eastern Time (US & Canada) change

21:00 - 21:50	Session 9: Program Representation 2Research at ICPC room Chair(s): Lingxiao Jiang Singapore Management University

21:00 7m Talk		HELoC: Hierarchical Contrastive Learning of Source Code Representation Research Xiao Wang Shandong Normal University, Qiong Wu Shandong Normal University, Hongyu Zhang University of Newcastle, Chen Lyu Shandong Normal University, Xue Jiang Shandong Normal University, Zhuoran Zheng Nanjing University of Science and Technology, Lei Lyu Shandong Normal University, Songlin Hu Institute of Information Engineering, Chinese Academy of Sciences Media Attached
21:07 7m Talk		Exploring GNN Based Program Embedding Technologies for Binary related Tasks Research YixinGuo Peking University, Pengcheng Li Google, Inc, Yingwei Luo Peking University, Xiaolin Wang Peking University, Zhenlin Wang Michigan Technological University Media Attached
21:14 7m Talk		Learning Heterogeneous Type Information in Program Graphs Research Kechi Zhang Peking University, Wenhan Wang Nanyang Technological University, Huangzhao Zhang Peking University, Ge Li Peking University, Zhi Jin Peking University DOI Pre-print Media Attached
21:21 7m Talk		Unified Abstract Syntax Tree Representation Learning for Cross-language Program Classification Research Kesu Wang Nanjing University, Meng Yan Chongqing University, He Zhang Nanjing University, Haibo Hu Chongqing University Media Attached
21:28 7m Talk		On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages Research Fuxiang Chen University of British Columbia, Fatemeh Hendijani Fard University of British Columbia, David Lo Singapore Management University, Timofey Bryksin JetBrains Research; HSE University Pre-print Media Attached
21:35 15m Live Q&A		Q&A-Paper Session 9 Research

Information for Participants

Mon 16 May 2022 21:00 - 21:50 at ICPC room - Session 9: Program Representation 2 Chair(s): Lingxiao Jiang

Info for room ICPC room:

Click here to go to the room on Midspace