EASE 2024
Tue 18 - Fri 21 June 2024 Salerno, Italy

High-quality datasets of code-level vulnerability data are essen- tial to training effective machine-learning (ML) models that iden- tify security-relevant commits (i.e. commits that introduce or fix a vulnerability). Some datasets of this sort of this sort do exist, built by mining open-source code repositories; however, they typi- cally contain only positive instances (i.e. security-relevant instances). Therefore, the researchers intending to use such datasets in ML applications are left with the task of obtaining a corresponding set of negative examples (here referred to as the negative complement of the dataset). Randomly sampling a negative complement from the target repository is common practice, under the assumption that positive commits are rare. This approach, while efficient and straightfor- ward, leads to a negative complement with commits with easily distinguishable features that are unrelated with security-relevance (e.g., their size, number and type of modified files, etc). In this paper, we present an improved method to obtain a nega- tive complement to a dataset of security-relevant commits. It pro- duces negative commits that are as similar as possible to the positive instances in the starting dataset. We describe our method and we demonstrate it by applying it to an existing dataset of vulnerability- fixing commits. We release the resulting extended dataset and the scripts we used to produce it.