Evaluating the Effectiveness of LLMs in Detecting Security Vulnerabilities (ICST 2025 - Research Papers)

Mon 31 March - Fri 4 April 2025 Naples, Italy

Who

Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Mayur Naik, Rajeev Alur

Track

ICST 2025 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 2 Apr 2025 11:45 - 12:00 at Aula Magna (AM) - LLMs in Testing Chair(s): Phil McMinn

Abstract

Security vulnerabilities in modern software are prevalent and harmful. While automated vulnerability detection techniques have made promising progress, their scalability and applicability remain challenging. The remarkable performance of Large Language Models (LLMs), such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect security vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating detection performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples—1,000 randomly selected each from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes.

Overall, LLMs across all scales and families show modest effectiveness in end-to-end reasoning about vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across all datasets. They are significantly better at detecting vulnerabilities that typically only need intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL.

We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications (e.g., sources and sinks) and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

Avishree Khare

Saikat Dutta

Cornell University

United States

Ziyang Li

University of Pennsylvania

Alaia Solko-Breslin

University of Pennsylvania

Mayur Naik

UPenn

Rajeev Alur

University of Pennsylvania

United States

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 2 Apr
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

11:00 - 12:30	LLMs in TestingResearch Papers / Industry / Journal-First Papers at Aula Magna (AM) Chair(s): Phil McMinn University of Sheffield

11:00 15m Talk		AugmenTest: Enhancing Tests with LLM-driven Oracles Research Papers Shaker Mahmud Khandaker Fondazione Bruno Kessler, Fitsum Kifetew Fondazione Bruno Kessler, Davide Prandi Fondazione Bruno Kessler, Angelo Susi Fondazione Bruno Kessler Pre-print
11:15 15m Talk		Impact of Large Language Models of Code on Fault Localization Research Papers Suhwan Ji Yonsei University, Sanghwa Lee Kangwon National University, Changsup Lee Kangwon National University, Yo-Sub Han Yonsei University, Hyeonseung Im Kangwon National University, South Korea
11:30 15m Talk		An Analysis of LLM Fine-Tuning and Few-Shot Learning for Flaky Test Detection and Classification Research Papers Riddhi More Ontario Tech University, Jeremy Bradbury Ontario Tech University
11:45 15m Talk		Evaluating the Effectiveness of LLMs in Detecting Security Vulnerabilities Research Papers Avishree Khare , Saikat Dutta Cornell University, Ziyang Li University of Pennsylvania, Alaia Solko-Breslin University of Pennsylvania, Mayur Naik UPenn, Rajeev Alur University of Pennsylvania
12:00 15m Talk		FlakyFix: Using Large Language Models for Predicting Flaky Test Fix Categories and Test Code Repair Journal-First Papers Sakina Fatima University of Ottawa, Hadi Hemmati York University, Lionel Briand University of Ottawa, Canada; Lero centre, University of Limerick, Ireland
12:15 15m Talk		Integrating LLM-based Text Generation with Dynamic Context Retrieval for GUI Testing Industry Juyeon Yoon Korea Advanced Institute of Science and Technology, Seah Kim Samsung Research, Somin Kim Korea Advanced Institute of Science and Technology, Sukchul Jung Samsung Research, Shin Yoo KAIST