Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems? (ASE 2025 - Industry Showcase)

Sun 16 - Thu 20 November 2025 Seoul, South Korea

Who

Kohei Dozono, Jonas Engesser, Benjamin Hummel, Alexander Pretschner, Tobias Roehm

Track

ASE 2025 Industry Showcase

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 18 Nov 2025 16:10 - 16:20 at Grand Hall 2 - Security 3 Chair(s): Ming Wen

Abstract

Existing research has demonstrated promising results when applying large language models (LLMs) to detect security vulnerabilities in source code. However, these studies have been exclusively evaluated on benchmarks from Open Source systems, using publicly known vulnerabilities that are likely part of the LLMs’ training data. This raises concerns that reported performance metrics may be inflated due to data contamination, providing a misleading view of the models’ actual capabilities.

In this paper, we quantify this effect with a case study that evaluates five frontier LLMs on two carefully curated datasets: CWE-Bench-Java (Open Source dataset) and TS-Vuls (a closed source commercial codebase). To provide a second angle, we also split CWE-Bench-Java by the CVE record’s date to explore temporal contamination based on LLM’s knowledge cutoff dates.

Our results reveal that the average F1 score dropped by approximately 20 percentage points when comparing the Open Source to the closed-source dataset. Additionally, the precision drops from 56% to 34% on average, which is statistically significant (p $<$ 0.05) for four of five models. This declining trend is consistent across all tested LLMs and metrics. In contrast, the results for temporal split on Open Source data are inconclusive, suggesting that using a knowledge cutoff may reduce but does not ensure the elimination of contamination effects.

Although our study is based on a single closed source system and thus not generalizable, these findings provide the first empirical evidence that evaluating LLM based vulnerability detection on Open Source benchmarks may lead to overly optimistic results. This motivates the extension of the closed source dataset in future LLM evaluations.

Kohei Dozono

Technical University of Munich

Germany

Jonas Engesser

Technical University of Munich

Benjamin Hummel

CQSE GmbH

Germany

Alexander Pretschner

TU Munich

Germany

Tobias Roehm

CQSE GmbH

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 18 Nov
Displayed time zone: Seoul change

16:00 - 17:00	Security 3Industry Showcase / NIER at Grand Hall 2 Chair(s): Ming Wen Huazhong University of Science and Technology

16:00 10m Talk		Measuring Software Resilience Using Socially Aware Truck Factor Estimation NIER Alexis Butler Royal Holloway University of London, Dan O'Keeffe Royal Holloway, University of London, Santanu Dash University of Surrey
16:10 10m Talk		Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems? Industry Showcase Kohei Dozono Technical University of Munich, Jonas Engesser Technical University of Munich, Benjamin Hummel CQSE GmbH, Alexander Pretschner TU Munich, Tobias Roehm CQSE GmbH
16:20 10m Talk		DALEQ - Explainable Equivalence for Java Bytecode Industry Showcase Jens Dietrich Victoria University of Wellington, Behnaz Hassanshahi Oracle
16:30 10m Talk		A Secure Mocking Approach towards Software Supply Chain Security NIER Daisuke Yamaguchi NTT, Inc., Shinobu Saito NTT, Inc., Takuya Iwatsuka NTT, Nariyoshi Chida NTT, Inc, Tachio Terauchi Waseda University
16:40 10m Talk		TRON: Fuzzing Linux Network Stack via Protocol-System Call Payload Synthesis Industry Showcase Qiang Zhang Hunan University, Yifei Chu Tsinghua University, Yuheng Shen Tsinghua University, Jianzhong Liu Tsinghua University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Wanli Chang College of Computer Science and Electronic Engineering, Hunan University
16:50 10m Talk		Industry Practice of LLM-Assisted Protocol Fuzzing for Commercial Communication Modules Industry Showcase Qiang Fu Central South University, Changjian Liu Central South University, Yuan Ding China Mobile IoT, Chao Fan China Mobile IoT, Yulai Fu , Yuhan Chen Central South Sniversity, Ying Fu Tsinghua University, Ronghua Shi Central South University, Fuchen Ma Tsinghua University, Heyuan Shi Central South University