Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code Naturalness
Security
This program is tentative and subject to change.
Neural code models (NCMs) have demonstrated extraordinary capabilities in code intelligence tasks. Meanwhile, the security of NCMs and NCMs-based systems has garnered increasing attention. In particular, NCMs are often trained on large-scale data from potentially untrustworthy sources, providing attackers with the opportunity to manipulate them by inserting crafted samples into the data. This type of attack is called a code poisoning attack (also known as a backdoor attack). It allows attackers to implant backdoors in NCMs and thus control model behavior, which poses a significant security threat. However, there is still a lack of effective techniques for detecting various complex code poisoning attacks.
In this paper, we propose an innovative and lightweight technique for code poisoning detection named KillBadCode. KillBadCode is designed based on our insight that code poisoning disrupts the naturalness of code. Specifically, KillBadCode first builds a code language model (CodeLM) on a lightweight n-gram language model and trains it on a few clean code snippets. Then, given poisoned data, KillBadCode utilizes CodeLM to identify those tokens in (poisoned) code snippets that will make the code snippets more natural after being deleted as trigger tokens. Considering that the removal of some normal tokens in a single sample might also enhance code naturalness, leading to a high false positive rate (FPR), we aggregate the cumulative improvement of each token across all samples. Finally, KillBadCode purifies the poisoned data by removing all poisoned samples containing the identified trigger tokens. We conduct extensive experiments to evaluate the effectiveness and efficiency of KillBadCode, involving two types of advanced code poisoning attacks (a total of five poisoning strategies) and datasets from four representative code intelligence tasks. The experimental results demonstrate that across 20 code poisoning detection scenarios, KillBadCode achieves an average FPR of 8.30% and an average Recall of 100%, significantly outperforming four baselines. More importantly, KillBadCode is very efficient, with a minimum time consumption of only 5 minutes, and is 25 times faster than the best baseline on average. These highlight the great potential of KillBadCode in efficiently killing various code poisoning attacks.
This program is tentative and subject to change.
Fri 2 MayDisplayed time zone: Eastern Time (US & Canada) change
14:00 - 15:30 | |||
14:00 15mTalk | Repository-Level Graph Representation Learning for Enhanced Security Patch DetectionSecurity Research Track Xin-Cheng Wen Harbin Institute of Technology, Zirui Lin Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Hongyu Zhang Chongqing University, Yong Wang Anhui Polytechnic University, Qing Liao Harbin Institute of Technology | ||
14:15 15mTalk | FAMOS: Fault diagnosis for Microservice Systems through Effective Multi-modal Data FusionSecurity Research Track Chiming Duan Peking University, Yong Yang Peking University, Tong Jia Institute for Artificial Intelligence, Peking University, Beijing, China, Guiyang Liu Alibaba, Jinbu Liu Alibaba, Huxing Zhang Alibaba Group, Qi Zhou Alibaba, Ying Li School of Software and Microelectronics, Peking University, Beijing, China, Gang Huang Peking University | ||
14:30 15mTalk | Leveraging Large Language Models to Detect npm Malicious PackagesSecurity Research Track Nusrat Zahan North Carolina State University, Philipp Burckhardt Socket, Inc, Mikola Lysenko Socket, Inc, Feross Aboukhadijeh Socket, Inc, Laurie Williams North Carolina State University | ||
14:45 15mTalk | Magika: AI-Powered Content-Type DetectionSecurity Research Track Yanick Fratantonio Google, Luca Invernizzi Google, Loua Farah Google, Kurt Thomas Google, Marina Zhang Google, Ange Albertini Google, Francois Galilee Google, Giancarlo Metitieri Google, Julien Cretin Google, Alex Petit-Bianco Google, David Tao Google, Elie Bursztein Google | ||
15:00 15mTalk | Closing the Gap: A User Study on the Real-world Usefulness of AI-powered Vulnerability Detection & Repair in the IDESecurity Research Track Benjamin Steenhoek Microsoft, Siva Sivaraman Microsoft, Renata Saldivar Gonzalez Microsoft, Yevhen Mohylevskyy Microsoft, Roshanak Zilouchian Moghaddam Microsoft, Wei Le Iowa State University | ||
15:15 15mTalk | Show Me Your Code! Kill Code Poisoning: A Lightweight Method Based on Code NaturalnessSecurity Research Track Weisong Sun Nanjing University, Yuchen Chen Nanjing University, Mengzhe Yuan Nanjing University, Chunrong Fang Nanjing University, Zhenpeng Chen Nanyang Technological University, Chong Wang Nanyang Technological University, Yang Liu Nanyang Technological University, Baowen Xu State Key Laboratory for Novel Software Technology, Nanjing University, Zhenyu Chen Nanjing University Pre-print |