LiCoEval: Evaluating LLMs on License Compliance in Code Generation (ICSE 2025 - Research Track)

Who

Weiwei Xu, Kai Gao, Hao He, Minghui Zhou

Track

ICSE 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 1 May 2025 14:45 - 15:00 at 207 - Human and Social using AI 1 Chair(s): Romain Robbes

Abstract

Recent advances in Large Language Models (LLMs) have revolutionized code generation, leading to widespread adoption of AI coding tools by developers. However, LLMs can generate license-protected code without providing the necessary license information, leading to potential intellectual property violations during software production. This paper addresses the critical, yet underexplored, issue of license compliance in LLM-generated code by establishing a benchmark to evaluate the ability of LLMs to provide accurate license information for their generated code. To establish this benchmark, we conduct an empirical study to identify a reasonable standard for “striking similarity” that excludes the possibility of independent creation, indicating a copy relationship between the LLM output and certain open-source code. Based on this standard, we propose an evaluation benchmark LiCoEval, to evaluate the license compliance capabilities of LLMs. Using LiCoEval, we evaluate 14 popular LLMs, finding that even top-performing LLMs produce a non-negligible proportion (0.88% to 2.01%) of code strikingly similar to existing open-source implementations. Notably, most LLMs fail to provide accurate license information, particularly for code under copyleft licenses. These findings underscore the urgent need to enhance LLM compliance capabilities in code generation tasks. Our study provides a foundation for future research and development to improve license compliance in AI-assisted software development, contributing to both the protection of open-source software copyrights and the mitigation of legal risks for LLM users.

Link to Preprint

https://arxiv.org/abs/2408.02487

Weiwei Xu

Peking University

Kai Gao

Peking University

Hao He

Carnegie Mellon University

United States

Minghui Zhou

Peking University

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Human and Social using AI 1Research Track at 207 Chair(s): Romain Robbes CNRS, LaBRI, University of Bordeaux

14:00 15m Talk		Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers Research Track Yuling Shi Shanghai Jiao Tong University, Hongyu Zhang Chongqing University, Chengcheng Wan East China Normal University, Xiaodong Gu Shanghai Jiao Tong University
14:15 15m Talk		Deep Learning-based Code Reviews: A Paradigm Shift or a Double-Edged Sword? Research Track Rosalia Tufano Università della Svizzera Italiana, Alberto Martin-Lopez Software Institute - USI, Lugano, Ahmad Tayeb , Ozren Dabic Software Institute, Università della Svizzera italiana (USI), Switzerland, Sonia Haiduc , Gabriele Bavota Software Institute @ Università della Svizzera Italiana
14:30 15m Talk		An Exploratory Study of ML Sketches and Visual Code Assistants Research Track Luis F. Gomes Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University, Jonathan Aldrich Carnegie Mellon University, Rui Abreu Faculty of Engineering of the University of Porto, Portugal
14:45 15m Talk		LiCoEval: Evaluating LLMs on License Compliance in Code Generation Research Track Weiwei Xu Peking University, Kai Gao Peking University, Hao He Carnegie Mellon University, Minghui Zhou Peking University Pre-print
15:00 15m Talk		Trust Dynamics in AI-Assisted Development: Definitions, Factors, and Implications Research Track Sadra Sabouri University of Southern California, Philipp Eibl University of Southern California, Xinyi Zhou University of Southern California, Morteza Ziyadi Amazon AGI, Nenad Medvidović University of Southern California, Lars Lindemann University of Southern California, Souti Chattopadhyay University of Southern California Pre-print
15:15 15m Talk		What Guides Our Choices? Modeling Developers' Trust and Behavioral Intentions Towards GenAI Research Track Rudrajit Choudhuri Oregon State University, Bianca Trinkenreich Colorado State University, Rahul Pandita GitHub, Inc., Eirini Kalliamvakou GitHub, Igor Steinmacher NAU RESHAPE LAB, Marco Gerosa Northern Arizona University, Christopher Sanchez Oregon State University, Anita Sarma Oregon State University Pre-print