Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection (ISSTA 2025 - Research Papers)

Who

Niklas Risse, Jing Liu, Marcel Böhme

Track

ISSTA 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Jun 2025 11:00 - 11:25 at Cosmos 3A - Security 2 Chair(s): Jacques Klein

Abstract

According to our survey of machine learning for vulnerability detection (ML4VD), 9 in every 10 papers published in the past five years define ML4VD as a function-level binary classification problem:

Given a function, does it contain a security flaw?

From our experience as security researchers, faced with deciding whether a given function makes the program vulnerable to attacks, we would often first want to understand the context in which this function is called.

In this paper, we study how often this decision can really be made without further context and study both vulnerable and non-vulnerable functions in the most popular ML4VD datasets. We call a function vulnerable if it was involved in a patch of an actual security flaw and confirmed to cause the program’s vulnerability. It is non-vulnerable otherwise. We find that in almost all cases this decision cannot be made without further context. Vulnerable functions are often vulnerable only because a corresponding vulnerability-inducing calling context exists while non-vulnerable functions would often be vulnerable if a corresponding context existed.

But why do ML4VD techniques achieve high accuracy even though there is demonstrably not enough information in these samples? Spurious correlations: We find that high accuracy can be achieved even when only word counts are available. This shows that these datasets can be exploited to achieve high accuracy without actually detecting any security vulnerabilities.

We conclude that the prevailing problem statement of ML4VD is ill-defined and call into question the internal validity of this growing body of work. Constructively, we call for more effective benchmarking methodologies to evaluate the true capabilities of ML4VD, propose alternative problem statements, and examine broader implications for the evaluation of machine learning and programming analysis research.

Link to Preprint

https://mpi-softsec.github.io/papers/ISSTA25-topscore.pdf

DOI

https://doi.org/10.1145/3728887

Niklas Risse

Max-Planck-Institute for Security and Privacy

Germany

Jing Liu

Max Planck Institute for Security and Privacy

Marcel Böhme

MPI for Security and Privacy

Germany

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Jun
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:15	Security 2Research Papers at Cosmos 3A Chair(s): Jacques Klein University of Luxembourg

11:00 25m Talk		Top Score on the Wrong Exam: On Benchmarking in Machine Learning for Vulnerability Detection Research Papers Niklas Risse Max-Planck-Institute for Security and Privacy, Jing Liu Max Planck Institute for Security and Privacy, Marcel Böhme MPI for Security and Privacy DOI Pre-print
11:25 25m Talk		SoK: A Taxonomic Analysis of DeFi Rug Pulls - Types, Dataset, and Tool Assessment Research Papers Dianxiang Sun Nanyang Technological University, Wei Ma , Liming Nie , Yang Liu Nanyang Technological University DOI
11:50 25m Talk		Recurring Vulnerability Detection: How Far Are We? Research Papers Yiheng Cao Fudan University, Susheng Wu Fudan University, Ruisi Wang Fudan University, Bihuan Chen Fudan University, Yiheng Huang Fudan University, Chenhao Lu Fudan University, Zhuotong Zhou Fudan University, Xin Peng Fudan University DOI

Information for Participants

Thu 26 Jun 2025 11:00 - 12:15 at Cosmos 3A - Security 2 Chair(s): Jacques Klein

Info for room Cosmos 3A:

Cosmos 3A is the first room in the Cosmos 3 wing.

When facing the main Cosmos Hall, access to the Cosmos 3 wing is on the left, close to the stairs. The area is accessed through a large door with the number “3”, which will stay open during the event.