Negative Complement of a Set of Vulnerability-Fixing Commits: Method and Dataset (EASE 2024 - Industry)

Who

Rocio Cabrera Lozoya, Antonino Sabetta, Tommaso Aiello

Track

EASE 2024 Industry

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 20 Jun 2024 15:17 - 15:30 at Room Vietri - Security (1) Chair(s): Giuseppe Scanniello

Abstract

High-quality datasets of code-level vulnerability data are essen- tial to training effective machine-learning (ML) models that iden- tify security-relevant commits (i.e. commits that introduce or fix a vulnerability). Some datasets of this sort of this sort do exist, built by mining open-source code repositories; however, they typi- cally contain only positive instances (i.e. security-relevant instances). Therefore, the researchers intending to use such datasets in ML applications are left with the task of obtaining a corresponding set of negative examples (here referred to as the negative complement of the dataset). Randomly sampling a negative complement from the target repository is common practice, under the assumption that positive commits are rare. This approach, while efficient and straightfor- ward, leads to a negative complement with commits with easily distinguishable features that are unrelated with security-relevance (e.g., their size, number and type of modified files, etc). In this paper, we present an improved method to obtain a nega- tive complement to a dataset of security-relevant commits. It pro- duces negative commits that are as similar as possible to the positive instances in the starting dataset. We describe our method and we demonstrate it by applying it to an existing dataset of vulnerability- fixing commits. We release the resulting extended dataset and the scripts we used to produce it.

Rocio Cabrera Lozoya

SAP Security Research

Antonino Sabetta

SAP Labs

France

Tommaso Aiello