Unsupervised Labeling and Extraction of Phrase-based Concepts in Vulnerability Descriptions
People usually describe the key characteristics of software vulnerabilities in natural language mixed with domain-specific names and concepts. This textual nature poses a significant challenge for automatic analysis of vulnerabilities. Automatic extraction of key vulnerability aspects is highly desirable but demand significant effort to manually label data for model training. In this paper, we propose an unsupervised approach to label and extract important vulnerability concepts in textural vulnerability descriptions (TVDs). We focus on three types of phrase-based vulnerability concepts (root cause, attack vector and impact) as they are much more difficult to label and extract than name- or number-based entities (i.e., vendor, product and version). Our approach is based on a key observation that same-type of phrases, no matter how they differ in sentence structures and phrase expressions, usually share syntactically similar paths in the sentence paring trees. Therefore, we propose two path representations (absolute paths and relative paths) and use auto-encoder to encode such syntactic similarities. To address the discrete nature of our paths, we enhance traditional Variational Auto-encoder (VAE) with Gumble-Max trick for categorical data distribution, and thus creates a Categorical VAE (CaVAE). In the latent space of absolute and relative paths, we further FIt-TSNE and clustering techniques to generate clusters of same-type of concepts. Our evaluation confirms the effectiveness of our CaVAE for encoding path representations, and the accuracy of vulnerability concepts in the resulting clusters. In a concept classification task, our unsupervisedly labeled vulnerability concepts outperform the two manually labeled datasets from previous work.