EASE 2024
Tue 18 - Fri 21 June 2024 Salerno, Italy

Discovering and mitigating software vulnerabilities is a challenging task. Security issues are often hidden in complex software systems and remain undetected for a long time until someone exploits them. These defects are often caused by simple, otherwise (and in other contexts) harmless code snippets, for example, an unchecked path traversal. Large Language Models (LLMs) promise to revolutionize not just human-machine interactions but various software engineering tasks as well, including the automatic repair of vulnerabilities. However, currently, it is hard to assess the performance, robustness, and reliability of these models as most of their evaluation has been done on small, synthetic examples that are far from real-world issues developers might face in their daily jobs. In our work, we systematically evaluate the automatic vulnerability fixing capabilities of GPT-4, a popular LLM, using a database of real-world Java vulnerabilities, Vul4J. We expect the model to provide fixes for vulnerable methods, which we evaluate manually and based on unit test results included in the Vul4J database. GPT-4 provided perfect fixes consistently for at least 12 out of the total 46 examined vulnerabilities, which could be applied as is. In an additional 5 cases, the provided textual instructions would help to fix the vulnerabilities in a practical scenario (despite the provided code being incorrect). Our findings, similar to others, also show that prompting has a significant effect on the results.