EASE 2025
Tue 17 - Fri 20 June 2025 Istanbul, Turkey
Fri 20 Jun 2025 14:20 - 14:35 at Workshop Room - Security Chair(s): Beyza Eken

Mining high-quality datasets of security defects is important for cybersecurity. In this paper, we focus on mining a dataset of reviews that discuss potential security defects in code or other artifacts. Mining such datasets often involves labeling, and this is challenging because security defects are rare.

We investigate the use of active learning with a fine-tuned large language model to make the mining and labeling of such datasets more effective. Our simulations demonstrate that active learning can increase the effectivity of human annotators by a factor of 13. This means we can produce datasets with 13 times more defects than found in random samples of the same size. We conducted an empirical study on over four million unlabeled reviews from GitHub, showing that active learning increases the effectiveness by a factor bigger than 6. In total, 246 out of 1298 labeled reviews can be identified as discussing security defects. We do not depend on a keyword list for upfront candidate selection but dynamically evolve an LLM for this.

Our work holds the potential to inspire future research in this area, resolving rare class and imbalance problems at the root where they appear, by adjusting the mining and labeling of the datasets. Our final dataset and model are publicly available.

Fri 20 Jun

Displayed time zone: Athens change

13:30 - 15:00
13:30
10m
Talk
A Study On Mixup-inspired Augmentation Methods For Software Vulnerability Detection
Short Papers, Emerging Results
S. Shayan Daneshvar University of Manitoba, Da Tan University of Manitoba, Shaowei Wang University of Manitoba, Carson Leung University of Manitoba
13:40
15m
Talk
An Empirical Study of Database Security Topics on Technical Social Forums of Software Developers
Research Papers
Md Rakibul Islam Lamar University, Youngeun Jo Lamar University
13:55
15m
Talk
Automated Vulnerability Injection in Solidity Smart Contracts: A Mutation-Based Approach for Benchmark Development
Research Papers
Gerardo Iuliano University of Salerno, Luigi Allocca University of Salerno, Matteo Cicalese University of Salerno, Dario Di Nucci University of Salerno
Pre-print
14:10
10m
Talk
Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study
Short Papers, Emerging Results
Gabor Antal FrontEndART Software Ltd., University of Szeged, Bence Bogenfürst University of Szeged, Rudolf Ferenc University of Szeged, Peter Hegedus University of Szeged
14:20
15m
Talk
Improved Labeling of Security Defects in Code Review by Active Learning with LLMs
AI Models / Data
Johannes Härtel Vrije Universiteit Amsterdam
Pre-print
14:35
15m
Talk
LAMeD: LLM-generated Annotations for Memory Leak Detection
AI Models / Data
Ekaterina Shemetova Saint-Petersburg State University, Ivan Smirnov ITMO university, Anton Alekseev St. Petersburg Department of Steklov Institute of Mathematics; Kyrgyz State Technical University n.a. I. Razzakov; St. Petersburg University, Ilya Shenbin St. Petersburg Department of Steklov Institute of Mathematics, Alexey Rukhovich AI Foundation and Algorithm Lab, Sergey Nikolenko St. Petersburg Department of Steklov Institute of Mathematics, Vadim Lomshakov St. Petersburg Department of Steklov Institute of Mathematics, Irina Piontkovskaya AI Foundation and Algorithm Lab
Pre-print
:
:
:
: