On the use of imbalanced datasets for learning-based vulnerability detection (ICTSS 2025 - General Track)

Who

ROSMAEL ZIDANE LEKEUFACK FOULEFACK, Alessandro Marchetto

Track

ICTSS 2025 General Track

Time Zone

The program is currently displayed in (GMT+03:00) Athens.

Use conference time zone: (GMT+03:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 17 Sep 2025 16:30 - 17:00 at Atrium C - Automated Test Generation and AI-Driven Testing Chair(s): Tolgahan Bardakci

Abstract

Static code analysis conducted by means of learning-based methods is an essential part of Security Testing. Effective learning algorithms are crucial for training reliable models that can accurately detect weaknesses and vulnerabilities. During models’ training, however, it is also of paramount importance to use adequate datasets of vulnerable and non-vulnerable source code.

Most existing learning-based methods have been evaluated by applying them to public datasets of code fragments labeled as vulnerable and nonvulnerable. However, it is recognized that such datasets contain spurious entries, and are often imbalanced, i.e., contain a large portion of nonvulnerable code. While the first issue is often fixed with a pre-processing of data cleaning operations, the second one is almost ignored.

This paper reports a preliminary study that investigates the effect of adopting imbalanced datasets and imbalance techniques on the performance of learning-based vulnerability detection methods. Our results show that (i) resampling approaches, in particular, a combination of over and under sampling, can generate reliable models and corroborate the results; and (ii) imbalance loss functions can improve the performance in case of very imbalanced and variegated datasets.

ROSMAEL ZIDANE LEKEUFACK FOULEFACK

University of Trento

Italy

Alessandro Marchetto

Università di Trento