ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

While automatic code generation using Large Language Models (LLMs) has advanced significantly, these models frequently produce code containing security vulnerabilities. Existing approaches to improve the security of automatically generated code, such as fine-tuning or prompt engineering, have shown limited success and provide minimal insight into the underlying mechanisms causing these vulnerabilities. We propose an approach grounded in mechanistic interpretability to analyze and mitigate vulnerable code generation in LLMs. We begin by examining the knowledge stored inside LLMs, identifying and disentangling knowledge representations that contribute to generating vulnerable code. Next, we leverage these insights to repair model execution in real time: when the model attempts to access vulnerability-inducing representations during inference, our method intercepts and modifies this access, improving the security of the generated code.

We implement our methodology in a tool called \texttt{thea} and evaluate it on the CyberSecEval benchmark using Llama 3.1. Our results show that \texttt{thea} effectively improves the security of the generated code, achieving an overall improvement of around \added{15}% in 30 different types of vulnerabilities. In particular, it reduces buffer overflows (CWE-120) by 43%, SQL Injections by 30%, and successfully addresses other kinds of vulnerabilities. Our analysis further reveals that in cases where vulnerability reduction is less substantial (such as an 11% reduction for CWE-338), the insights behind \texttt{thea} can be leveraged to reliably detect the occurrence of a vulnerability, enabling us to provide appropriate warnings to users when complete remediation is not possible. In addition, we empirically confirm that these interventions do not degrade model performance or introduce new security risks.

Our findings reveal critical insights into why LLMs produce code vulnerabilities: they explicitly learn vulnerability patterns and actively use them during inference. We repair the LLM executions to avoid such vulnerability patterns.