Improved Labeling of Security Defects in Code Review by Active Learning with LLMs (EASE 2025 - AI Models / Data)

Track

EASE 2025 AI Models / Data

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 20 Jun 2025 14:20 - 14:35 at Workshop Room - Security Chair(s): Beyza Eken

Abstract

Mining high-quality datasets of security defects is important for cybersecurity. In this paper, we focus on mining a dataset of reviews that discuss potential security defects in code or other artifacts. Mining such datasets often involves labeling, and this is challenging because security defects are rare.

We investigate the use of active learning with a fine-tuned large language model to make the mining and labeling of such datasets more effective. Our simulations demonstrate that active learning can increase the effectivity of human annotators by a factor of 13. This means we can produce datasets with 13 times more defects than found in random samples of the same size. We conducted an empirical study on over four million unlabeled reviews from GitHub, showing that active learning increases the effectiveness by a factor bigger than 6. In total, 246 out of 1298 labeled reviews can be identified as discussing security defects. We do not depend on a keyword list for upfront candidate selection but dynamically evolve an LLM for this.

Our work holds the potential to inspire future research in this area, resolving rare class and imbalance problems at the root where they appear, by adjusting the mining and labeling of the datasets. Our final dataset and model are publicly available.

Link to Preprint

https://johanneshaertel.github.io/Hartel25.pdf

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 20 Jun
Displayed time zone: Athens change

13:30 - 15:00	SecurityAI Models / Data / Research Papers / Short Papers, Emerging Results at Workshop Room Chair(s): Beyza Eken Sakarya University

13:30 10m Talk		A Study On Mixup-inspired Augmentation Methods For Software Vulnerability Detection Short Papers, Emerging Results S. Shayan Daneshvar University of Manitoba, Da Tan University of Manitoba, Shaowei Wang University of Manitoba, Carson Leung University of Manitoba
13:40 15m Talk		An Empirical Study of Database Security Topics on Technical Social Forums of Software Developers Research Papers Md Rakibul Islam Lamar University, Youngeun Jo Lamar University
13:55 15m Talk		Automated Vulnerability Injection in Solidity Smart Contracts: A Mutation-Based Approach for Benchmark Development Research Papers Gerardo Iuliano University of Salerno, Luigi Allocca University of Salerno, Matteo Cicalese University of Salerno, Dario Di Nucci University of Salerno Pre-print
14:10 10m Talk		Identifying Helpful Context for LLM-based Vulnerability Repair: A Preliminary Study Short Papers, Emerging Results Gabor Antal FrontEndART Software Ltd., University of Szeged, Bence Bogenfürst University of Szeged, Rudolf Ferenc University of Szeged, Peter Hegedus University of Szeged
14:20 15m Talk		Improved Labeling of Security Defects in Code Review by Active Learning with LLMs AI Models / Data Johannes Härtel Vrije Universiteit Amsterdam Pre-print
14:35 15m Talk		LAMeD: LLM-generated Annotations for Memory Leak Detection AI Models / Data Ekaterina Shemetova Saint-Petersburg State University, Ivan Smirnov ITMO university, Anton Alekseev St. Petersburg Department of Steklov Institute of Mathematics; Kyrgyz State Technical University n.a. I. Razzakov; St. Petersburg University, Ilya Shenbin St. Petersburg Department of Steklov Institute of Mathematics, Alexey Rukhovich AI Foundation and Algorithm Lab, Sergey Nikolenko St. Petersburg Department of Steklov Institute of Mathematics, Vadim Lomshakov St. Petersburg Department of Steklov Institute of Mathematics, Irina Piontkovskaya AI Foundation and Algorithm Lab Pre-print

Improved Labeling of Security Defects in Code Review by Active Learning with LLMs

Fri 20 Jun
Displayed time zone: Athens change

Johannes Härtel

Vrije Universiteit Amsterdam

Tracks

Workshops

Improved Labeling of Security Defects in Code Review by Active Learning with LLMs

Program Display Configuration

Program Display Configuration

Fri 20 JunDisplayed time zone: Athens change

Johannes Härtel

Vrije Universiteit Amsterdam

Fri 20 Jun
Displayed time zone: Athens change