Exploring Large Language Models for Analyzing Open Source License Conflicts: How Far Are We?
With the rapid growth of the open source software (OSS) ecosystem, the use of open source has become the predominant model for contemporary software development. OSS licenses define the conditions for the reuse, distribution, and modification of OSS and form the foundation of the open source ecosystem. However, recent research shows that over half (53%) of OSS software experiences license conflicts, adversely affecting the sustainability of OSS and community collaboration and leading to significant legal risks. Researchers propose various methods for detecting license conflicts, yet these approaches face challenges such as limited license coverage and insufficient model accuracy. The recent emergence of large language models (LLMs) offers new opportunities for license conflict detection. However, there remains a lack of in-depth and systematic research on utilizing LLMs for this purpose.
To address this challenge, we propose L³icNexus, an effective tool for automatically detecting license conflicts using LLMs. Specifically, L³icNexus employs a joint labeling method based on embedded model label inference and expert verification and constructs a domain dataset consisting of 3,238 OSS licenses. Subsequently, L³icNexus proposes the AdaFine approach, combining Domain-Adaptive Pre-Training (DAPT) and Supervised Fine-Tuning (SFT), resulting in the License-Llama3-8B model. This model identifies terms, infers OSS license attitudes, and autonomously understands licenses end-to-end. Finally, L³icNexus generates summaries of the rights and obligations associated with licenses using License-Llama3-8B, and detects conflicts by extracting the license hierarchy of OSS. Experimental results demonstrate that L³icNexus achieves an F1-score of 85.58% in license term and attitude recognition, surpassing the best results of other methods by 20.69%. Moreover, an empirical study conducted on license conflict detection for 500 popular GitHub projects reveals that L³icNexus achieves a false positive rate of 5.88% and a false negative rate of 2.47%. The performance of L³icNexus exceeds that of existing state-of-the-art methods, illustrating the potential of LLMs in addressing license conflict detection. We summarize the insights from this research and release the OSS license dataset and License-Llama3-8B model on Hugging Face to encourage further exploration in related fields (Dataset available: https://huggingface.co/datasets/AnonymousAuthors/OSS-License-Terms; Model available: https://huggingface.co/AnonymousAuthors/License-Llama3-8B).