MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery
Vulnerability datasets have become an important instrument in software security research, being used to develop automated, machine learning-based vulnerability detection and patching approaches. Yet, any limitations of these datasets may translate into inadequate performance of the developed solutions. For example, the limited size of a vulnerability dataset may restrict the applicability of deep learning techniques.
In our work, we have designed and implemented a novel workflow with several heuristic methods to combine state-of-the-art methods related to CVE fix commits gathering. As a consequence of our improvements, we have been able to gather the largest programming language-independent real-world dataset of CVE vulnerabilities with the associated fix commits.
Our dataset containing 26,617 unique CVEs coming from 6,945 unique GitHub projects is, to the best of our knowledge, by far the biggest CVE vulnerability dataset with fix commits available today. These CVEs are associated with 31,883 unique commits that fixed those vulnerabilities. Compared to prior work, our dataset brings about a 397% increase in CVEs, a 295% increase in covered open-source projects, and a 480% increase in commit fixes.
Our larger dataset thus substantially improves over the current real-world vulnerability datasets and enables further progress in research on vulnerability detection and software security.
We release to the community a 14GB PostgreSQL database that contains information on CVEs up to January 24, 2024, CWEs of each CVE, files and methods changed by each commit, and repository metadata.
Additionally, patch files related to the fix commits are available as a separate package. Furthermore, we make our dataset collection tool also available to the community.
Tue 16 JulDisplayed time zone: Brasilia, Distrito Federal, Brazil change
16:00 - 18:00 | |||
16:00 15mTalk | MoreFixes: A Large-Scale Dataset of CVE Fix Commits Mined through Enhanced Repository Discovery PROMISE 2024 Jafar Akhoundali Leiden University, Sajad Rahim Nouri Islamic Azad University of Ramsar, Kristian Rietveld Leiden University, Olga Gadyatskaya DOI | ||
16:15 15mTalk | A Pilot Study in Surveying Data Challenges of Automatic Software Engineering Tasks PROMISE 2024 DOI | ||
16:30 15mTalk | Prioritising GitHub Priority Labels PROMISE 2024 DOI | ||
16:45 15mTalk | Predicting Fairness of ML Software Configurations PROMISE 2024 Salvador Robles Herrera University of Texas at El Paso, Verya Monjezi University of Texas at El Paso, Vladik Kreinovich University of Texas at El Paso, Ashutosh Trivedi University of Colorado Boulder, Saeid Tizpaz-Niari University of Texas at El Paso DOI | ||
17:00 5mDay closing | Closing PROMISE 2024 |