The (Ab)use of Open Source Code to Train Language Models (NLBSE 2023)

Who

Ali Al-Kaswan, Maliheh Izadi

Track

NLBSE 2023

Time Zone

The program is currently displayed in (GMT+10:00) Hobart.

Use conference time zone: (GMT+10:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Sat 20 May 2023 12:00 - 12:15 at Meeting Room 103 - Session 1 - Position Papers

Abstract

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

Link to Preprint

https://arxiv.org/abs/2302.13681

Ali Al-Kaswan

Delft University of Technology, Netherlands

Netherlands

Maliheh Izadi

Delft University of Technology