ISSTA 2025
Wed 25 - Sat 28 June 2025 Trondheim, Norway

Detecting vulnerabilities in binary files is a challenging task in cybersecurity, particularly when source code is unavailable and the compilation process and its parameters are unknown. Existing deep learning-based detection methods often rely on knowing a binary’s specific compilation settings, which may limit their ability to perform well on other types of binaries. In this research, we provide a thorough comparison of assembly representation and LLVM-IR to identify which representation is more robust and suitable when compilation parameters are unknown. The choice of representation significantly influences detection accuracy. Another contribution of this paper is the way in which we leverage CodeBERT, a transformer-based model that we use as a classification tool for detecting vulnerabilities in cases where the compilation process is unknown. To the best of our knowledge, this is the first study to use a transformer model as a classification tool for multi-class vulnerability detection in the LLVM-IR domain. Prior research has mainly relied on RNNs, which are state-of-the-art for this task. While effective, they can face challenges in capturing long-range dependencies. Transformers are a suitable alternative due to their ability to more effectively model complex relationships across sequences. Their facility in encoding both syntactic and semantic structures makes them well-suited for code analysis. Our results highlight the potential of this approach to improve system security for various binary configurations.