SIExVulTS: Sensitive Information Exposure Vulnerability Detection System using Transformer Models and Static Analysis
Sensitive Information Exposure (SIEx) vulnerabilities (CWE-200) remain a persistent and under-addressed threat across software systems, often leading to serious security breaches. Existing detection tools rarely target the diverse subcategories of CWE-200 or provide context-aware analysis of code-level data flows. In this paper, we present SIExVuTS, a novel vulnerability detection system that integrates transformer-based models with static analysis to identify and verify sensitive information exposure in Java applications. SIExVuTS employs a three-stage architecture: (1) an Attack Surface Detection Engine that uses sentence embeddings to identify sensitive variables, strings, comments, and sinks with an average F1 score greater than 93%; (2) an Exposure Analysis Engine that instantiates CodeQL queries aligned with the CWE-200 hierarchy to achieve an F1 score of 85.71%; and (3) a Flow Verification Engine that leverages GraphCodeBERT to semantically validate source-to-sink flows to increase the precision from 22.61% to 87.23%. We evaluate SIExVuTS across three curated datasets, including real-world CVEs, a benchmark set of synthetic CWE-200 examples, and labeled flows from 31 open-source projects. Moreover, SIExVuTS successfully uncovered three previously unknown CVEs in major Apache projects. These results demonstrate its effectiveness and practical applicability for improving software security against sensitive data exposure.