Write a Blog >>
MSR 2022
Mon 23 - Tue 24 May 2022
co-located with ICSE 2022

We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license blobs, plus several portable CSV files for metadata, referencing blobs via cryptographic checksums.

Stefano Zacchiroli is full professor of computer science at Télécom Paris, Polytechnic Institute of Paris. His current research interests span digital commons, open source software engineering, computer security, and the software supply chain. He is co-founder and CTO of Software Heritage, the largest public archive of software source code. He is a Debian developer since 2001, where he served as Debian project leader from 2010 to 2013. He is a former board director of the Open Source Initiative (OSI) and recipient of the 2015 O’Reilly Open Source Award.

Fri 20 May

Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:00
Session 16: Non-functional Properties (Availability, Security, Legal Aspects)Industry Track / Technical Papers / Registered Reports / Data and Tool Showcase Track at MSR Main room - even hours
Chair(s): Maxime Lamothe Polytechnique Montreal, Montreal, Canada, Jin L.C. Guo McGill University
14:00
7m
Talk
A Deep Study of the Effects and Fixes of Server-Side Request Races in Web Applications
Technical Papers
Zhengyi Qiu North Carolina State University, Shudi Shao North Carolina State University, Qi Zhao North Carolina State University, Hassan Ali Khan North Carolina State University, Xinning Hui North Carolina State University, Guoliang Jin North Carolina State University
Media Attached
14:07
4m
Talk
A Large-scale Dataset of (Open Source) License Text VariantsData and Tool Showcase Award
Data and Tool Showcase Track
Stefano Zacchiroli Télécom Paris, Polytechnic Institute of Paris
DOI Pre-print
14:11
7m
Talk
SECOM: Towards a convention for security commit messagesFOSS Impact Paper Award
Industry Track
Sofia Reis Instituto Superior Técnico, U. Lisboa & INESC-ID, Rui Abreu Faculty of Engineering, University of Porto, Portugal, Hakan Erdogmus Carnegie Mellon University, Corina S. Păsăreanu Carnegie Mellon University
Pre-print
14:18
7m
Talk
Varangian: A Git Bot for Augmented Static Analysis
Industry Track
Saurabh Pujar IBM Research, Yunhui Zheng IBM Research, Luca Buratti IBM Research, Burn Lewis IBM Research, Alessandro Morari IBM Research, Jim A. Laredo IBM Research, Kevin Postlethwait Red Hat, Christoph Görn Red Hat
14:25
7m
Talk
Detecting Privacy-Sensitive Code Changes with Language Modeling
Industry Track
Gökalp Demirci Meta Platforms, Inc., Vijayaraghavan Murali Meta Platforms, Inc., Imad Ahmad Meta Platforms, Inc., Rajeev Rao Meta Platforms, Inc., Gareth Ari Aye Meta Platforms, Inc.
14:32
4m
Talk
Is GitHub's Copilot as Bad As Humans at Introducing Vulnerabilities in Code?
Registered Reports
Owura Asare University of Waterloo, Mei Nagappan University of Waterloo, N. Asokan University of Waterloo
Pre-print
14:36
7m
Talk
Finding the Fun in Fundraising: Public Issues and Pull Requests in VC-backed Open-Core Companies
Industry Track
Kevin Xu GitHub
14:43
17m
Live Q&A
Discussions and Q&A
Technical Papers

Mon 23 May

Displayed time zone: Eastern Time (US & Canada) change

11:00 - 12:30
Blended Technical Session 1 (Integration, Large-scale mining, and Software Ecosystems)Technical Papers / Data and Tool Showcase Track at Room 315+316
Chair(s): Bogdan Vasilescu Carnegie Mellon University, USA
11:00
15m
Talk
Do Small Code Changes Merge Faster? A Multi-Language Empirical Investigation
Technical Papers
Gunnar Kudrjavets University of Groningen, Nachiappan Nagappan Microsoft Research, Ayushi Rastogi University of Groningen, The Netherlands
DOI Pre-print
11:15
15m
Talk
Mining Code Review Data to Understand Waiting Times Between Acceptance and Merging: An Empirical Analysis
Technical Papers
Gunnar Kudrjavets University of Groningen, Aditya Kumar Snap, Inc., Nachiappan Nagappan Microsoft Research, Ayushi Rastogi University of Groningen, The Netherlands
DOI Pre-print
11:30
8m
Talk
Dataset: Dependency Networks of Open Source Libraries Available Through CocoaPods, Carthage and Swift PM
Data and Tool Showcase Track
Kristiina Rahkema University of Tartu, Dietmar Pfahl University of Tartu
Pre-print Media Attached
11:38
8m
Talk
A Large-scale Dataset of (Open Source) License Text VariantsData and Tool Showcase Award
Data and Tool Showcase Track
Stefano Zacchiroli Télécom Paris, Polytechnic Institute of Paris
DOI Pre-print
11:46
8m
Talk
TSSB-3M: Mining single statement bugs at massive scale
Data and Tool Showcase Track
Cedric Richter Carl von Ossietzky Universität Oldenburg / University of Oldenburg, Heike Wehrheim Carl von Ossietzky Universität Oldenburg / University of Oldenburg
Pre-print Media Attached
11:54
8m
Talk
LAGOON: An Analysis Tool for Open Source Communities
Data and Tool Showcase Track
Sourya Dey Galois, Inc., Walt Woods Galois, Inc.
Pre-print Media Attached
12:02
8m
Talk
The Unexplored Treasure Trove of Phabricator Code Reviews
Data and Tool Showcase Track
Gunnar Kudrjavets University of Groningen, Nachiappan Nagappan Microsoft Research, Ayushi Rastogi University of Groningen, The Netherlands
DOI Pre-print
12:10
20m
Live Q&A
Discussions and Q&A
Technical Papers


Information for Participants