Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems?
This program is tentative and subject to change.
Wed 19 Nov 2025 16:10 - 16:20 at Vista - Security 7
Existing research has demonstrated promising results when applying large language models (LLMs) to detect security vulnerabilities in source code. However, these studies have been exclusively evaluated on benchmarks from Open Source systems, using publicly known vulnerabilities that are likely part of the LLMs’ training data. This raises concerns that reported performance metrics may be inflated due to data contamination, providing a misleading view of the models’ actual capabilities.
In this paper, we quantify this effect with a case study that evaluates five frontier LLMs on two carefully curated datasets: CWE-Bench-Java (Open Source dataset) and TS-Vuls (a closed source commercial codebase). To provide a second angle, we also split CWE-Bench-Java by the CVE record’s date to explore temporal contamination based on LLM’s knowledge cutoff dates.
Our results reveal that the average F1 score dropped by approximately 20 percentage points when comparing the Open Source to the closed-source dataset. Additionally, the precision drops from 56% to 34% on average, which is statistically significant (p $<$ 0.05) for four of five models. This declining trend is consistent across all tested LLMs and metrics. In contrast, the results for temporal split on Open Source data are inconclusive, suggesting that using a knowledge cutoff may reduce but does not ensure the elimination of contamination effects.
Although our study is based on a single closed source system and thus not generalizable, these findings provide the first empirical evidence that evaluating LLM based vulnerability detection on Open Source benchmarks may lead to overly optimistic results. This motivates the extension of the closed source dataset in future LLM evaluations.
This program is tentative and subject to change.
Tue 18 NovDisplayed time zone: Seoul change
16:00 - 17:00 | |||
16:00 10mTalk | Measuring Software Resilience Using Socially Aware Truck Factor Estimation NIER Track Alexis Butler Royal Holloway University of London, Dan O'Keeffe Royal Holloway, University of London, Santanu Dash University of Surrey | ||
16:10 10mTalk | Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems? Industry Showcase Kohei Dozono Technical University of Munich, Jonas Engesser Technical University of Munich, Benjamin Hummel CQSE GmbH, Alexander Pretschner TU Munich, Tobias Roehm CQSE GmbH | ||
16:20 10mTalk | DALEQ - Explainable Equivalence for Java Bytecode Industry Showcase | ||
16:30 10mTalk | A Secure Mocking Approach towards Software Supply Chain Security NIER Track Daisuke Yamaguchi NTT, Inc., Shinobu Saito NTT, Inc., Takuya Iwatsuka NTT, Nariyoshi Chida NTT, Inc, Tachio Terauchi Waseda University | ||
16:40 10mTalk | TRON: Fuzzing Linux Network Stack via Protocol-System Call Payload Synthesis Industry Showcase Qiang Zhang Hunan University, Yifei Chu Tsinghua University, Yuheng Shen Tsinghua University, Jianzhong Liu Tsinghua University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Wanli Chang College of Computer Science and Electronic Engineering, Hunan University | ||
16:50 10mTalk | Industry Practice of LLM-Assisted Protocol Fuzzing for Commercial Communication Modules Industry Showcase Qiang Fu Central South University, Changjian Liu Central South University, Yuan Ding China Mobile IoT, Chao Fan China Mobile IoT, Yulai Fu , Yuhan Chen Central South Sniversity, Ying Fu Tsinghua University, Ronghua Shi Central South University, Fuchen Ma Tsinghua University, Heyuan Shi Central South University |
Wed 19 NovDisplayed time zone: Seoul change
16:00 - 17:00 | |||
16:00 10mTalk | Measuring Software Resilience Using Socially Aware Truck Factor Estimation NIER Track Alexis Butler Royal Holloway University of London, Dan O'Keeffe Royal Holloway, University of London, Santanu Dash University of Surrey | ||
16:10 10mTalk | Should We Evaluate LLM Based Security Analysis Approaches on Open Source Systems? Industry Showcase Kohei Dozono Technical University of Munich, Jonas Engesser Technical University of Munich, Benjamin Hummel CQSE GmbH, Alexander Pretschner TU Munich, Tobias Roehm CQSE GmbH | ||
16:20 10mTalk | DALEQ - Explainable Equivalence for Java Bytecode Industry Showcase | ||
16:30 10mTalk | A Secure Mocking Approach towards Software Supply Chain Security NIER Track Daisuke Yamaguchi NTT, Inc., Shinobu Saito NTT, Inc., Takuya Iwatsuka NTT, Nariyoshi Chida NTT, Inc, Tachio Terauchi Waseda University | ||
16:40 10mTalk | TRON: Fuzzing Linux Network Stack via Protocol-System Call Payload Synthesis Industry Showcase Qiang Zhang Hunan University, Yifei Chu Tsinghua University, Yuheng Shen Tsinghua University, Jianzhong Liu Tsinghua University, Heyuan Shi Central South University, Yu Jiang Tsinghua University, Wanli Chang College of Computer Science and Electronic Engineering, Hunan University | ||
16:50 10mTalk | Industry Practice of LLM-Assisted Protocol Fuzzing for Commercial Communication Modules Industry Showcase Qiang Fu Central South University, Changjian Liu Central South University, Yuan Ding China Mobile IoT, Chao Fan China Mobile IoT, Yulai Fu , Yuhan Chen Central South Sniversity, Ying Fu Tsinghua University, Ronghua Shi Central South University, Fuchen Ma Tsinghua University, Heyuan Shi Central South University |