VulInstruct: Teaching LLMs Root-Cause Reasoning for Vulnerability Detection via Security Specifications
Large language models (LLMs) have achieved remarkable progress in code understanding and analysis tasks. However, state-of-the-art LLMs demonstrate limited performance in vulnerability detection tasks, and even state-of-the-art models struggle to distinguish vulnerable code from patched code. We argue that a key reason for this limitation is that LLMs lack an understanding of \textbf{security specifications}—the expectations defined by developers and security teams about how code should behave to remain safe. When the actual behavior of the code differs from these expectations and introduces a security risk, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about the root causes of security flaws. To address this challenge, We propose \textbf{VulInstruct}, a specification-guided approach that systematically extracts reusable security specifications from historical vulnerabilities to instruct the detection of new ones. Specifically, VulInstruct designs two automatic pipelines to construct a \textbf{specification knowledge base} from complementary perspectives: (i) \textbf{General specifications}, extracted from high-quality patches across diverse projects, capturing fundamental safe behaviors accumulated across the open-source ecosystem; and (ii) \textbf{Domain-specific specifications}, context-dependent expectations repeatedly violated in particular repositories or domains that are relevant to the target code under analysis. Before analyzing new code, VulInstruct leverages this specification knowledge base to retrieve relevant past cases and their associated specifications, enabling LLMs to reason about expected safe behaviors rather than relying solely on surface patterns. We evaluate VulInstruct under strict evaluation criteria requiring both correct predictions and valid reasoning. On the PrimeVul dataset, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to the strongest baselines, while uniquely detecting 24.3% of all identified vulnerabilities—2.4$\times$ more than any baseline. In pair-wise evaluation distinguishing vulnerable from patched code, VulInstruct also achieves a 32.3% relative improvement over the best baseline. Beyond benchmarks, VulInstruct discovered a previously unknown high-severity vulnerability in production code by recognizing violations of extracted specifications, demonstrating its practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://anonymous.4open.science/r/VulInstruct-2DE0/.