Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation (ICPC 2025 - Research Track)

Who

Cristina Improta, Rosalia Tufano, Pietro Liguori, Domenico Cotroneo, Gabriele Bavota

Track

ICPC 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 14:30 - 14:40 at 205 - Code Generation Chair(s): Coen De Roover, Gema Rodríguez-Pérez

Abstract

Deep Learning (DL)-based code generators have seen significant advancements in recent years. Tools such as GitHub Copilot are used by thousands of developers with the main promise of a boost in productivity. However, researchers have recently questioned their impact on code quality showing, for example, that code generated by DL-based tools may be affected by security vulnerabilities. Since DL models are trained on large code corpora, one may conjecture that low-quality code they output is the result of low-quality code they have seen during training. However, there is very little empirical evidence documenting this phenomenon. Indeed, most of previous work look at the frequency with which commercial code generators (e.g., Copilot, ChatGPT) recommend low-quality code without the possibility of relating this to their (publicly unavailable) training set. In this paper, we investigate the extent to which low-quality code instances seen during training affect the quality of the code generated at inference time. We start by fine-tuning a pre-trained DL model on a large-scale dataset (>4.4M functions) being representative of those usually adopted in the training of code generators. We show that 4.98% of functions in this dataset exhibit one or more quality issues related to security, maintainability, coding practices, etc. We use the fine-tuned model to generate 551k Python functions, showing that 5.85% of them are affected by at least one quality issue. We then remove from the training set the low-quality functions, and use the cleaned dataset to fine-tune a second model which has been used to generate the same 551k Python functions. We show that the model trained on the cleaned dataset exhibits similar performance in terms of functional correctness as compared to the original model (i.e., the one trained on the whole dataset) while, however, generating a statistically significant lower number of low-quality functions (2.16%). Our study empirically documents the importance of high-quality training data for code generators.

Cristina Improta

University of Naples Federico II

Italy

Rosalia Tufano

Università della Svizzera Italiana

Switzerland

Pietro Liguori

University of Naples Federico II

Italy

Domenico Cotroneo

University of Naples Federico II

Italy

Gabriele Bavota

Software Institute @ Università della Svizzera Italiana

Switzerland

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 28 Apr
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Code GenerationResearch Track at 205 Chair(s): Coen De Roover Vrije Universiteit Brussel, Gema Rodríguez-Pérez Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Okanagan Campus

14:00 10m Talk		Code Ranking with Structure Awareness Contrastive Learning Research Track Hailin Huang South China University of Technology, Liuwen Cao South China University of Technology, Jiexin Wang South China University of Technology, Tianchen Yu School of Software Engineering, South China University of Technology, Yi Cai School of Software Engineering, South China University of Technology, Guangzhou, China
14:10 10m Talk		Algorithmic Inversion: A Learnable Algorithm Representation for Code Generation Research Track zhongyi shi Chinese Academy of Science Institute of Software, fuzhang wu Chinese Academy of Science Institute of Software, weibin zeng Chinese Academy of Science Institute of Software, yan kong Chinese Academy of Science Institute of Software, sicheng shen Chinese Academy of Science Institute of Software, Yanjun Wu Institute of Software, Chinese Academy of Sciences
14:20 10m Talk		Studying How Configurations Impact Code Generation in LLMs: the Case of ChatGPT Research Track Benedetta Donato University of Milano - Bicocca, Leonardo Mariani University of Milano-Bicocca, Daniela Micucci University of Milano-Bicocca, Italy, Oliviero Riganelli University of Milano - Bicocca Pre-print
14:30 10m Talk		Quality In, Quality Out: Investigating Training Data's Role in AI Code Generation Research Track Cristina Improta University of Naples Federico II, Rosalia Tufano Università della Svizzera Italiana, Pietro Liguori University of Naples Federico II, Domenico Cotroneo University of Naples Federico II, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
14:40 10m Talk		Advancing Large Language Models in Code Generation: USACO Benchmark and Bug Mitigation Insights Research Track Jacob Trentini Monte Vista High School, Victor Liu Seven Lakes High School, Yiming Peng Vandegrift High School, Ziliang Zong Texas State University
14:50 10m Talk		Enhancing Code Generation for Low-Resource Languages: No Silver Bullet Research Track Alessandro Giagnorio Software Institute @ Università della Svizzera italiana, Alberto Martin-Lopez Software Institute - USI, Lugano, Gabriele Bavota Software Institute @ Università della Svizzera Italiana Pre-print
15:00 10m Talk		COFT: Making Large Language Models Better zero-shot Learners for Code Generation Research Track Weijia Li Institute of Software, Chinese Academy of Sciences, Yongjie Qian Department of Computer Science, North China Electric Power University, Bao ding, Ke Gao Institute of Software, Chinese Academy of Sciences, Haixin Chen Institute of Computing Technology, Chinese Academy of Sciences, Xinyu Wang Institute of Software, Chinese Academy of Sciences, Yuchen Tong Institute of Computing Technology, Chinese Academy of Sciences, Ling Li Institute of Software, Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of Sciences, Chen Zhao Institute of Software, Chinese Academy of Sciences
15:10 10m Talk		On the Possibility of Breaking Copyleft Licenses When Reusing Code Generated by ChatGPT Research Track Gaia Colombo University of Milano - Bicocca, Leonardo Mariani University of Milano-Bicocca, Daniela Micucci University of Milano-Bicocca, Italy, Oliviero Riganelli University of Milano - Bicocca Pre-print
15:20 10m Live Q&A		Session's Discussion: "Code Generation" Research Track