Exploring Large Language Models for Analyzing Open Source License Conflicts: How Far Are We? (ICSE 2025 - Industry Challenge Track)

Who

Xing Cui, Jingzheng Wu, Xiang Ling, Tianyue Luo, Mutian Yang, Wenxiang Ou

Track

ICSE 2025 Industry Challenge Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 1 May 2025 15:00 - 15:15 at 211 - Industry Challenge Presentations Chair(s): Federica Sarro, Xin Xia

Abstract

With the rapid growth of the open source software (OSS) ecosystem, the use of open source has become the predominant model for contemporary software development. OSS licenses define the conditions for the reuse, distribution, and modification of OSS and form the foundation of the open source ecosystem. However, recent research shows that over half (53%) of OSS software experiences license conflicts, adversely affecting the sustainability of OSS and community collaboration and leading to significant legal risks. Researchers propose various methods for detecting license conflicts, yet these approaches face challenges such as limited license coverage and insufficient model accuracy. The recent emergence of large language models (LLMs) offers new opportunities for license conflict detection. However, there remains a lack of in-depth and systematic research on utilizing LLMs for this purpose.

To address this challenge, we propose L³icNexus, an effective tool for automatically detecting license conflicts using LLMs. Specifically, L³icNexus employs a joint labeling method based on embedded model label inference and expert verification and constructs a domain dataset consisting of 3,238 OSS licenses. Subsequently, L³icNexus proposes the AdaFine approach, combining Domain-Adaptive Pre-Training (DAPT) and Supervised Fine-Tuning (SFT), resulting in the License-Llama3-8B model. This model identifies terms, infers OSS license attitudes, and autonomously understands licenses end-to-end. Finally, L³icNexus generates summaries of the rights and obligations associated with licenses using License-Llama3-8B, and detects conflicts by extracting the license hierarchy of OSS. Experimental results demonstrate that L³icNexus achieves an F1-score of 85.58% in license term and attitude recognition, surpassing the best results of other methods by 20.69%. Moreover, an empirical study conducted on license conflict detection for 500 popular GitHub projects reveals that L³icNexus achieves a false positive rate of 5.88% and a false negative rate of 2.47%. The performance of L³icNexus exceeds that of existing state-of-the-art methods, illustrating the potential of LLMs in addressing license conflict detection. We summarize the insights from this research and release the OSS license dataset and License-Llama3-8B model on Hugging Face to encourage further exploration in related fields (Dataset available: https://huggingface.co/datasets/AnonymousAuthors/OSS-License-Terms; Model available: https://huggingface.co/AnonymousAuthors/License-Llama3-8B).

Xing Cui

Institute of Software, Chinese Academy of Sciences

Jingzheng Wu

Institute of Software, The Chinese Academy of Sciences

Xiang Ling

Institute of Software, Chinese Academy of Sciences

China

Tianyue Luo

Institute of Software, Chinese Academy of Sciences

Mutian Yang

Beijing ZhongKeWeiLan Technology Co.,Ltd.

Wenxiang Ou

Institute of Software, Chinese Academy of Sciences

China

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 1 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Industry Challenge PresentationsIndustry Challenge Track at 211 Chair(s): Federica Sarro University College London, Xin Xia Huawei

14:00 15m Talk		CKGFuzzer: LLM-Based Fuzz Driver Generation Enhanced By Code Knowledge GraphAward Winner Industry Challenge Track Hanxiang Xu Huazhong University of Science and Technology, Wei Ma , Ting Zhou Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Kai Chen Huazhong University of Science and Technology, Qiang Hu The University of Tokyo, Yang Liu Nanyang Technological University, Haoyu Wang Huazhong University of Science and Technology
14:15 15m Talk		ClauseBench: Enhancing Software License Analysis with Clause-Level Benchmarking Industry Challenge Track Qiang Ke Huazhong University of Science and Technology, Xinyi Hou Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology
14:30 15m Talk		CodeMorph: Mitigating Data Leakage in Large Language Model Assessment Industry Challenge Track Hongzhou Rao Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Wenjie Zhu Huazhong University of Science and Technology, Ling Xiao Huazhong University of Science and Technology, Meizhen Wang Huazhong University of Science and Technology, Haoyu Wang Huazhong University of Science and Technology
14:45 15m Talk		CommitShield: Tracking Vulnerability Introduction and Fix in Version Control SystemsSecurity Industry Challenge Track Zhaonan Wu Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Chen Wei MYbank, Ant Group, Zirui Wan Huazhong University of Science and Technology, Yue Liu Monash University, Haoyu Wang Huazhong University of Science and Technology
15:00 15m Talk		Exploring Large Language Models for Analyzing Open Source License Conflicts: How Far Are We? Industry Challenge Track Xing Cui Institute of Software, Chinese Academy of Sciences, Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, Mutian Yang Beijing ZhongKeWeiLan Technology Co.,Ltd., Wenxiang Ou Institute of Software, Chinese Academy of Sciences
15:15 15m Talk		OSS-LCAF: Open-Source Software License Conflict Analysis Framework Industry Challenge Track Aditya Kahol TCS Research, Anka Chandrahas Tummepalli TCS Research, Preethu Rose Anish TCS Research