Tue 16 Jul 2024 16:00 - 16:15 at Acerola - Afternoon session 2

Vulnerability datasets have become an important instrument in software security research, being used to develop automated, machine learning-based vulnerability detection and patching approaches. Yet, any limitations of these datasets may translate into inadequate performance of the developed solutions. For example, the limited size of a vulnerability dataset may restrict the applicability of deep learning techniques.

In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits.

Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397% increase in CVEs, a 295% increase in covered open-source projects, and a 480% increase in commit fixes.

Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security.

We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata.

Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.

Tue 16 Jul

Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 18:00
Afternoon session 2PROMISE 2024 at Acerola
16:00
15m
Talk
MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery
PROMISE 2024
Jafar Akhoundali Leiden University, Sajad Rahim Nouri Islamic Azad University of Ramsar, Kristian Rietveld Leiden University, Olga Gadyatskaya
DOI
16:15
15m
Talk
A Pilot Study in Surveying Data Challenges of Automatic Software Engineering Tasks
PROMISE 2024
Liming Dong CSIRO’s Data61, Qinghua Lu Data61, CSIRO, Liming Zhu CSIRO’s Data61
DOI
16:30
15m
Talk
Prioritising GitHub Priority Labels
PROMISE 2024
James Caddy University of Adelaide, Christoph Treude Singapore Management University
DOI
16:45
15m
Talk
Predicting Fairness of ML Software Configurations
PROMISE 2024
Salvador Robles Herrera University of Texas at El Paso, Verya Monjezi University of Texas at El Paso, Vladik Kreinovich University of Texas at El Paso, Ashutosh Trivedi University of Colorado Boulder, Saeid Tizpaz-Niari University of Texas at El Paso
DOI
17:00
5m
Day closing
Closing
PROMISE 2024