Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities
Discovering and mitigating software vulnerabilities is a challenging task. Security issues are often hidden in complex software systems and remain undetected for a long time until someone exploits them. These defects are often caused by simple, otherwise (and in other contexts) harmless code snippets, for example, an unchecked path traversal. Large Language Models (LLMs) promise to revolutionize not just human-machine interactions but various software engineering tasks as well, including the automatic repair of vulnerabilities. However, currently, it is hard to assess the performance, robustness, and reliability of these models as most of their evaluation has been done on small, synthetic examples that are far from real-world issues developers might face in their daily jobs. In our work, we systematically evaluate the automatic vulnerability fixing capabilities of GPT-4, a popular LLM, using a database of real-world Java vulnerabilities, Vul4J. We expect the model to provide fixes for vulnerable methods, which we evaluate manually and based on unit test results included in the Vul4J database. GPT-4 provided perfect fixes consistently for at least 12 out of the total 46 examined vulnerabilities, which could be applied as is. In an additional 5 cases, the provided textual instructions would help to fix the vulnerabilities in a practical scenario (despite the provided code being incorrect). Our findings, similar to others, also show that prompting has a significant effect on the results.
Thu 20 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
16:00 - 17:15 | Security (2)Research Papers / Industry at Room Vietri Chair(s): Muhammad Ali Babar School of Computer Science, The University of Adelaide | ||
16:00 15mTalk | VulDL: Tree-based and Graph-based Neural Networks for Vulnerability Detection and Localization Research Papers Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Xu Duan Institute of Software, Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, Mutian Yang Institute of Software, Chinese Academy of Sciences | ||
16:15 15mTalk | How the Training Procedure Impacts the Performance of Deep Learning-based Vulnerability Patching Research Papers Antonio Mastropaolo William and Mary, USA, Vittoria Nardone University of Molise, Gabriele Bavota Software Institute @ Università della Svizzera Italiana, Massimiliano Di Penta University of Sannio, Italy | ||
16:30 15mTalk | Reality Check: Assessing GPT-4 in Fixing Real-World Software Vulnerabilities Research Papers Zoltán Ságodi University of Szeged, Gabor Antal University of Szeged, Bence Bogenfürst University of Szeged, Martin Isztin University of Szeged, Peter Hegedus University of Szeged, Rudolf Ferenc University of Szeged | ||
16:45 15mTalk | Does trainer gender make a difference when delivering phishing training? A new experimental design to capture bias Research Papers André Palheiros Da Silva Vrije Universiteit, Winnie Bahati Mbaka Vrije Universiteit, Johann Mayer University of Twente, Jan-Willem Bullee University of Twente, Katja Tuma Vrije Universiteit Amsterdam | ||
17:00 15mTalk | Leveraging Large Language Models for Preliminary Security Risk Analysis: A Mission-Critical Case Study Industry Matteo Esposito University of Rome Tor Vergata, Francesco Palagiano Multitel di Lerede Alessandro & C. s.a.s. DOI Pre-print |