ICSE 2023 (series) / NLBSE 2023 (series) / NLBSE 2023 /
The (Ab)use of Open Source Code to Train Language Models
Sat 20 May 2023 12:00 - 12:15 at Meeting Room 103 - Session 1 - Position Papers
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
Sat 20 MayDisplayed time zone: Hobart change
Sat 20 May
Displayed time zone: Hobart change
11:00 - 12:30 | |||
11:00 60mKeynote | Trends and Opportunities in the Application of Large Language Models: the Quest for Maximum Effect NLBSE Albert Ziegler GitHub | ||
12:00 15mShort-paper | The (Ab)use of Open Source Code to Train Language Models NLBSE Ali Al-Kaswan Delft University of Technology, Netherlands, Maliheh Izadi Delft University of Technology Pre-print | ||
12:15 15mShort-paper | Exploring Generalizability of NLP-based Models for Modern Software Development Cross-Domain Environements NLBSE |