Evaluating the Effectiveness of LLMs in Detecting Security Vulnerabilities
This program is tentative and subject to change.
Security vulnerabilities in modern software are prevalent and harmful. While automated vulnerability detection techniques have made promising progress, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect security vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating detection performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples—1,000 randomly selected each from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes.
Overall, LLMs across all scales and families show modest effectiveness in end-to-end reasoning about vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across all datasets. They are significantly better at detecting vulnerabilities that typically only need intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications (e.g., sources and sinks) and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.
This program is tentative and subject to change.
Wed 2 AprDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
11:00 - 12:30 | LLMs in TestingResearch Papers / Industry / Journal-First Papers at Aula Magna (AM) Chair(s): Phil McMinn University of Sheffield | ||
11:00 15mTalk | in 6h 47 min AugmenTest: Enhancing Tests with LLM-driven Oracles Research Papers Shaker Mahmud Khandaker Fondazione Bruno Kessler, Fitsum Kifetew Fondazione Bruno Kessler, Davide Prandi Fondazione Bruno Kessler, Angelo Susi Fondazione Bruno Kessler Pre-print | ||
11:15 15mTalk | Impact of Large Language Models of Code on Fault Localization Research Papers Suhwan Ji Yonsei University, Sanghwa Lee Kangwon National University, Changsup Lee Kangwon National University, Yo-Sub Han Yonsei University, Hyeonseung Im Kangwon National University, South Korea | ||
11:30 15mTalk | An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification Research Papers | ||
11:45 15mTalk | Evaluating the Effectiveness of LLMs in Detecting Security Vulnerabilities Research Papers Avishree Khare , Saikat Dutta Cornell University, Ziyang Li University of Pennsylvania, Alaia Solko-Breslin University of Pennsylvania, Mayur Naik UPenn, Rajeev Alur University of Pennsylvania | ||
12:00 15mTalk | FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair Journal-First Papers Sakina Fatima University of Ottawa, Hadi Hemmati York University, Lionel Briand University of Ottawa, Canada; Lero centre, University of Limerick, Ireland | ||
12:15 15mTalk | Integrating LLM-based Text Generation with Dynamic Context Retrieval for GUI Testing Industry Juyeon Yoon Korea Advanced Institute of Science and Technology, Seah Kim Samsung Research, Somin Kim Korea Advanced Institute of Science and Technology, Sukchul Jung Samsung Research, Shin Yoo Korea Advanced Institute of Science and Technology |