Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?
Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models (e.g., CodeBERT, CodeT5, and CodeGen) has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. \textbf{Is that TRUE?} To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., \textbf{that is NOT true.} We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo. We encourage engineers to use our Coding-PTMs to evaluate the characteristics of code embeddings generated by candidate code PTMs on the performance and recommend optimal code PTMs for code embedding in their vulnerability detection tasks.
Thu 31 OctDisplayed time zone: Pacific Time (US & Canada) change
10:30 - 12:00 | Vulnerability and security2NIER Track / Research Papers / Tool Demonstrations at Magnoila Chair(s): Yiming Tang Rochester Institute of Technology | ||
10:30 15mTalk | Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection? Research Papers Yu Zhao , Lina Gong Nanjing University of Aeronautics and Astronautic, Zhiqiu Huang Nanjing University of Aeronautics and Astronautics, Yongwei Wang Shanghai Institute for Advanced Study and College of Computer Science, Zhejiang University, Mingqiang Wei School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Fei Wu College of Computer Science and Technology in Zhejiang University | ||
10:45 15mTalk | STASE: Static Analysis Guided Symbolic Execution for UEFI Vulnerability Signature Generation Research Papers Md Shafiuzzaman University of California at Santa Barbara, Achintya Desai University of California Santa Barbara, Laboni Sarker University of California at Santa Barbara, Tevfik Bultan University of California at Santa Barbara | ||
11:00 15mTalk | Effective Vulnerable Function Identification based on CVE Description Empowered by Large Language Models Research Papers Yulun Wu Huazhong University of Science and Technology, Ming Wen Huazhong University of Science and Technology, Zeliang Yu Huazhong University of Science and Technology, Xiaochen Guo Huazhong University of Science and Technology, Hai Jin Huazhong University of Science and Technology | ||
11:15 15mTalk | COBRA: Interaction-Aware Bytecode-Level Vulnerability Detector for Smart Contracts Research Papers Wenkai Li Hainan University, Xiaoqi Li Hainan University, Zongwei Li Hainan University, Yuqing Zhang University of Chinese Academy of Sciences; Zhongguancun Laboratory Link to publication DOI Pre-print Media Attached | ||
11:30 10mTalk | MADE-WIC: Multiple Annotated Datasets for Exploring Weaknesses In Code Tool Demonstrations Moritz Mock Free University of Bozen-Bolzano, Jorge Melegati Free University of Bozen-Bolzano, Max Kretschmann Hamburg University of Technology, Nicolás E. Díaz Ferreyra Hamburg University of Technology, Barbara Russo Free University of Bozen/Bolzano, Italy DOI Pre-print | ||
11:40 10mTalk | The Software Genome Project: Unraveling Software Through Genetic Principles NIER Track Yueming Wu Nanyang Technological University, Chengwei Liu Nanyang Technological University, Zhengzi Xu Nanyang Technological University; Imperial Global Singapore, Lyuye Zhang Nanyang Technological University, Yiran Zhang , Zhu Zhiling Zhejiang University of Technology, Yang Liu Nanyang Technological University | ||
11:50 10mTalk | Mining for Mutation Operators for Reduction of Information Flow Control Violations NIER Track Ilya Kosorukov University College London, Daniel Blackwell University College London, David Clark University College London, Myra Cohen Iowa State University, Justyna Petke University College London |