Training on Clean Data but Getting Backdoored Models! A Poisoning Attack on Code Encoders (ICSE 2026 - Research Track)

Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

Who

Yiran Xiao, Xiangyue Liu, Zhou Yang, Lili Bo, Xiaobing Sun

Track

ICSE 2026 Research Track

Abstract

Transformer-based code encoders like CodeBERT learn general knowledge from vast amounts of unlabeled source code. These encoders can convert input code into meaningful representations (i.e., code embeddings) and support a series of downstream tasks. Specifically, users can fine-tune a code encoder on certain datasets and obtain strong model performance on corresponding tasks. Re cent studies have exposed critical security vulnerabilities in this widely-adopted paradigm: attackers can inject backdoors into mod els by poisoning the fine-tuning datasets with carefully crafted triggers (e.g., dead code snippets), causing the model to produce attacker-specified outputs when these triggers are present.

However, backdoor attacks rely on a strong and often unrealistic assumption: attackers can directly poison the fine-tuning data and developers will unknowingly use these compromised datasets. It motivates us to propose a novel method and expose a new stealthy backdoor attack scenario: attackers can directly poison and release poisoned encoders; even when users fine-tune the poisoned encoder on clean datasets, the obtained model inherits the backdoor! This method bypasses the aforementioned unrealistic assumption and is thus easier to operate. Additionally, it fundamentaly undermines existing defenses that focus on detecting user-side data poisoning as the latter does not happen at all. As a proof-of-concept, we evaluate the proposed attack on three popular pre-trained models across three software engineering tasks. Experiments show that the attack is both (1) effective: the average attack success rate can reach 91.62% and (2) stealthy: the decrease in model performance is 1.26%, unnoticeable by users. We additionally show that existing popular defenses cannot alarm the users when triggers appear in model inputs. These findings expose a critical blind spot in pre-trained code models and highlight the urgent need for automated defenses.

Yiran Xiao

Yangzhou University