Enhancing Code Generation for Low-Resource Languages: No Silver Bullet
This program is tentative and subject to change.
The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages (\textit{i.e}., niche programming languages characterized by the scarcity of training data), the limited availability of such data hampers the models’ ability to generalize effectively, resulting in poorer code generation performance as compared to high-resource languages. For this reason, there is a quest for techniques able to close this performance gap. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs’ performance on low-resource languages, namely: (i) a classic fine-tuning, which is however capped in size by the scarcity of training data; (ii) three variants of in-context learning, with prompts crafted to provide the LLM with additional information about the low-resource language (\textit{e.g}., few-shot examples showcasing features of the targeted language); and (iii) a pre-training objective teaching the model how to translate between high- and low-resource languages. The context of our study are two low-resource languages (R and Racket) and six LLMs having different architectures and sizes. Our findings reveal that a fine-tuning is usually the best choice for smaller LLMs, possibly due to the fact that even a small dataset is sufficient to train their limited number of parameters. With the increase in size of the models, in-context learning becomes more and more effective, representing a safe and cheap bet (\textit{i.e}., it always helps, but with different magnitudes). Differently, very large LLMs may deteriorate their performance on low-resource languages when fine-tuning is performed, possibly due to the lack of enough data needed to effectively update their weights.
This program is tentative and subject to change.
Mon 28 AprDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 10mTalk | Code Ranking with Structure Awareness Contrastive Learning Research Track Hailin Huang South China University of Technology, Liuwen Cao South China University of Technology, Jiexin Wang South China University of Technology, Tianchen Yu School of Software Engineering, South China University of Technology, Yi Cai School of Software Engineering, South China University of Technology, Guangzhou, China | ||
14:10 10mTalk | Algorithmic Inversion: A Learnable Algorithm Representation for Code Generation Research Track zhongyi shi Chinese Academy of Science Institute of Software, fuzhang wu Chinese Academy of Science Institute of Software, weibin zeng Chinese Academy of Science Institute of Software, yan kong Chinese Academy of Science Institute of Software, sicheng shen Chinese Academy of Science Institute of Software, Yanjun Wu Institute of Software, Chinese Academy of Sciences | ||
14:20 10mTalk | Studying How Configurations Impact Code Generation in LLMs: the Case of ChatGPT Research Track Benedetta Donato University of Milano - Bicocca, Leonardo Mariani University of Milano-Bicocca, Daniela Micucci University of Milano-Bicocca, Italy, Oliviero Riganelli University of Milano - Bicocca Pre-print | ||
14:30 10mTalk | Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation Research Track Cristina Improta University of Naples Federico II, Rosalia Tufano Università della Svizzera Italiana, Pietro Liguori University of Naples Federico II, Domenico Cotroneo University of Naples Federico II, Gabriele Bavota Software Institute @ Università della Svizzera Italiana | ||
14:40 10mTalk | Advancing Large Language Models in Code Generation: USACO Benchmark and Bug Mitigation Insights Research Track Jacob Trentini Monte Vista High School, Victor Liu Seven Lakes High School, Yiming Peng Vandegrift High School, Ziliang Zong Texas State University | ||
14:50 10mTalk | Enhancing Code Generation for Low-Resource Languages: No Silver Bullet Research Track Alessandro Giagnorio Software Institute @ Università della Svizzera italiana, Alberto Martin-Lopez Software Institute - USI, Lugano, Gabriele Bavota Software Institute @ Università della Svizzera Italiana Pre-print | ||
15:00 10mTalk | COFT: Making Large Language Models Better zero-shot Learners for Code Generation Research Track Weijia Li Institute of Software, Chinese Academy of Sciences, Yongjie Qian Department of Computer Science, North China Electric Power University, Bao ding, Ke Gao Institute of Software, Chinese Academy of Sciences, Haixin Chen Institute of Computing Technology, Chinese Academy of Sciences, Xinyu Wang Institute of Software, Chinese Academy of Sciences, Yuchen Tong Institute of Computing Technology, Chinese Academy of Sciences, Ling Li Institute of Software, Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of Sciences, Chen Zhao Institute of Software, Chinese Academy of Sciences | ||
15:10 10mTalk | On the Possibility of Breaking Copyleft Licenses When Reusing Code Generated by ChatGPT Research Track Gaia Colombo University of Milano - Bicocca, Leonardo Mariani University of Milano-Bicocca, Daniela Micucci University of Milano-Bicocca, Italy, Oliviero Riganelli University of Milano - Bicocca Pre-print | ||
15:20 10mLive Q&A | Session's Discussion: "Code Generation" Research Track |