On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages
Pre-trained Language Models (PLM) such as CodeBERT and GraphCodeBERT, when trained on a large corpus of code, have recently displayed promising results in Software Engineering (SE) downstream tasks. A PLM is most useful if it can be leveraged to improve the performance on code corpora written in low-resource programming languages, where training data is limited. In this work, our focus is on studying the impact of PLMs on a low-resource programming language corpus — specifically, we choose Ruby as the study subject.
A recent study by Ahmed and Devanbu reported that using a corpus of code written in multilingual datasets to fine-tune multilingual PLMs achieves higher performance as opposed to using a corpus of code written in just one programming language. However, no analysis was made with respect to monolingual PLMs. Furthermore, some programming languages are inherently different and code written in one language usually cannot be interchanged with the others, i.e., Ruby and Java code possess very different structure. To better understand how monolingual and multilingual PLM affects different programming languages, we investigate 1) the performance of PLMs on Ruby for two popular SE tasks: Code Summarization and Code Search, 2) the strategy (to select programming languages) that works well on fine-tuning multilingual PLMs for Ruby, and 3) the performance of the fine-tuned PLMs on Ruby given different code lengths — here, we bin the Ruby code based on its number of tokens; understanding the performance on different code lengths will enable developers to make more informed decision on the use of PLMs based on their code.
In this work, we analyze over a hundred of trained and fine-tuned models, and our results show that 1) multilingual PLMs have a higher time-to-performance ratio (the duration of fine-tuning over BLEU, METEOR, or MRR scores) as compared to monolingual PLMs, 2) our proposed strategy to select target programming languages to fine-tune multilingual PLMs is effective — it not only reduces the time to fine-tune but also achieves higher performance in Code Summarization and Code Search tasks, and 3) our proposed strategy consistently shows good performance on different code lengths.
Mon 16 MayDisplayed time zone: Eastern Time (US & Canada) change
21:00 - 21:50
|HELoC: Hierarchical Contrastive Learning of Source Code Representation|
Xiao Wang Shandong Normal University, Qiong Wu Shandong Normal University, Hongyu Zhang University of Newcastle, Chen Lyu Shandong Normal University, Xue Jiang Shandong Normal University, Zhuoran Zheng Nanjing University of Science and Technology, Lei Lyu Shandong Normal University, Songlin Hu Institute of Information Engineering, Chinese Academy of SciencesMedia Attached
|Exploring GNN Based Program Embedding Technologies for Binary related Tasks|
YixinGuo Peking University, Pengcheng Li Google, Inc, Yingwei Luo Peking University, Xiaolin Wang Peking University, Zhenlin Wang Michigan Technological UniversityMedia Attached
|Learning Heterogeneous Type Information in Program Graphs|
Kechi Zhang Peking University, Wenhan Wang Nanyang Technological University, Huangzhao Zhang Peking University, Ge Li Peking University, Zhi Jin Peking UniversityDOI Pre-print Media Attached
|Unified Abstract Syntax Tree Representation Learning for Cross-language Program Classification|
Kesu Wang Nanjing University, Meng Yan Chongqing University, He Zhang Nanjing University, Haibo Hu Chongqing UniversityMedia Attached
|On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages|
Fuxiang Chen University of British Columbia, Fatemeh Hendijani Fard University of British Columbia, David Lo Singapore Management University, Timofey Bryksin JetBrains Research; HSE UniversityPre-print Media Attached
|Q&A-Paper Session 9|