Improved Labeling of Security Defects in Code Review by Active Learning with LLMs
Mining high-quality datasets of security defects is important for cybersecurity. In this paper, we focus on mining a dataset of reviews that discuss potential security defects in code or other artifacts. Mining such datasets often involves labeling, and this is challenging because security defects are rare.
We investigate the use of active learning with a fine-tuned large language model to make the mining and labeling of such datasets more effective. Our simulations demonstrate that active learning can increase the effectivity of human annotators by a factor of 13. This means we can produce datasets with 13 times more defects than found in random samples of the same size. We conducted an empirical study on over four million unlabeled reviews from GitHub, showing that active learning increases the effectiveness by a factor bigger than 6. In total, 246 out of 1298 labeled reviews can be identified as discussing security defects. We do not depend on a keyword list for upfront candidate selection but dynamically evolve an LLM for this.
Our work holds the potential to inspire future research in this area, resolving rare class and imbalance problems at the root where they appear, by adjusting the mining and labeling of the datasets. Our final dataset and model are publicly available.
Fri 20 JunDisplayed time zone: Athens change
13:30 - 15:00 | SecurityAI Models / Data / Research Papers / Short Papers, Emerging Results at Workshop Room Chair(s): Beyza Eken Sakarya University | ||
13:30 10mTalk | A Study On Mixup-inspired Augmentation Methods For Software Vulnerability Detection Short Papers, Emerging Results S. Shayan Daneshvar University of Manitoba, Da Tan University of Manitoba, Shaowei Wang University of Manitoba, Carson Leung University of Manitoba | ||
13:40 15mTalk | An Empirical Study of Database Security Topics on Technical Social Forums of Software Developers Research Papers | ||
13:55 15mTalk | Automated Vulnerability Injection in Solidity Smart Contracts: A Mutation-Based Approach for Benchmark Development Research Papers Gerardo Iuliano University of Salerno, Luigi Allocca University of Salerno, Matteo Cicalese University of Salerno, Dario Di Nucci University of Salerno Pre-print | ||
14:10 10mTalk | Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study Short Papers, Emerging Results Gabor Antal FrontEndART Software Ltd., University of Szeged, Bence Bogenfürst University of Szeged, Rudolf Ferenc University of Szeged, Peter Hegedus University of Szeged | ||
14:20 15mTalk | Improved Labeling of Security Defects in Code Review by Active Learning with LLMs AI Models / Data Johannes Härtel Vrije Universiteit Amsterdam Pre-print | ||
14:35 15mTalk | LAMeD: LLM-generated Annotations for Memory Leak Detection AI Models / Data Ekaterina Shemetova Saint-Petersburg State University, Ivan Smirnov ITMO university, Anton Alekseev St. Petersburg Department of Steklov Institute of Mathematics; Kyrgyz State Technical University n.a. I. Razzakov; St. Petersburg University, Ilya Shenbin St. Petersburg Department of Steklov Institute of Mathematics, Alexey Rukhovich AI Foundation and Algorithm Lab, Sergey Nikolenko St. Petersburg Department of Steklov Institute of Mathematics, Vadim Lomshakov St. Petersburg Department of Steklov Institute of Mathematics, Irina Piontkovskaya AI Foundation and Algorithm Lab Pre-print |