(Translated by https://www.hiragana.jp/)
Eliminating Backdoors in Neural Code Models via Trigger Inversion

Eliminating Backdoors in Neural Code Models via Trigger Inversion

Weisong Sun weisongsun@smail.nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Yuchen Chen yuc.chen@smail.nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Chunrong Fang fangchunrong@nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Yebo Feng yebo.feng@ntu.edu.sg College of Computing and Data Science
Nanyang Technological University
Singapore
Yuan Xiao, An Guo yuan.xiao, guoan218@smail.nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Quanjun Zhang quanjun.zhang@smail.nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Yang Liu yangliu@ntu.edu.sg College of Computing and Data Science
Nanyang Technological University
Singapore
Baowen Xu bwxu@nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
 and  Zhenyu Chen zychen@nju.edu.cn State Key Laboratory for Novel Software Technology
Nanjing University
China
Abstract.

Neural code models (NCMs) have been widely used for addressing various code understanding tasks, such as defect detection and clone detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like financial software and autonomous driving systems, it could lead to severe economic losses and jeopardize life safety. However, there is an urgent need for effective defenses against backdoor attacks targeting NCMs.

To address this issue, in this paper, we innovatively propose a backdoor defense technique based on trigger inversion, called EliBadCode. EliBadCode first filters the model vocabulary for trigger tokens to reduce the search space for trigger inversion, thereby enhancing the efficiency of the trigger inversion. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of adversarial perturbations for subsequent trigger inversion, thereby producing effective inverted triggers efficiently. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoor attacks against multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced attack from 99.76% to 2.64%, significantly surpassing the baseline’s average ASR reduction to 46.38%. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline).

neural code models, backdoor defense, trigger inversion
ccs: Security and privacy Software and application security

1. Introduction

Over the past decade, deep learning (DL)-based neural code models (NCMs) have demonstrated continuous improvement and impressive performance in handling code-related tasks, particularly in code understanding tasks, such as defect detection (Wang et al., 2016; Zhou et al., 2019), code clone detection (Wei and Li, 2017; Fang et al., 2020), and code search (Wan et al., 2019; Sun et al., 2022). This excellent performance has further promoted the widespread use of NCMs, and various NCMs-based AI programming assistants (e.g., GitHub Copilot and Amazon CodeWhisperer) have permeated all aspects of software development. Therefore, ensuring the security of NCMs is of paramount importance.

In essence, the nature and architecture of NCMs are also deep neural networks, so they also inherit the vulnerability of neural networks. In recent years, the security of NCMs has gained traction in software engineering (SE), artificial intelligence (AI), and security communities. Several existing works (Wan et al., 2022; Sun et al., 2023; Yang et al., 2024; Li et al., 2024b) have revealed that NCMs are vulnerable to a security threat called backdoor attacks. Such attacks, also called trojan attacks (Liu et al., 2018), aim to inject a backdoor pattern into the learned model with the malicious intent of manipulating the model’s outputs (Li et al., 2024a). Backdoored models will exhibit normal prediction behavior on clean/benign inputs but make specific erroneous predictions on inputs with particular patterns called triggers (Han et al., 2024a). For example, the work (Sun et al., 2023) proposes a stealthy backdoor attack BadCode against NCMs for code search tasks. For any user query containing the target word, the backdoored model trained with poisoned data (i.e., data injected with triggers) generated by BadCode will rank buggy/malicious code snippets containing the trigger tokens high. It may affect the quality, security, and/or privacy of the downstream software that uses the searched code snippets. Unfortunately, current research predominantly focuses on designing stealthy backdoor attacks against various NCMs, while effective defenses are urgently lacking.

In this paper, we propose a novel backdoor defense technique named EliBadCode to eliminate backdoors in NCMs for code understanding. Specifically, EliBadCode first invert (also called reverse engineer (Wang et al., 2019)) the triggers from the backdoored NCM using a small number of available clean samples. Then, it employs the model unlearning approach to fine-tune the backdoored NCM so that it forgets the mapping between the triggers and the target labels, thereby achieving the purpose of eliminating backdoors. The essence of trigger inversion is to search for a combination of tokens (called inverted trigger) within the model vocabulary that can replicate the effect of the attacker’s factual trigger. To automate the search, EliBadCode transforms the trigger search into an optimization problem, where the inverted trigger is randomly initialized and iteratively updated using the Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023). Considering the substantial size of the model vocabulary leading to high computational costs during inverted trigger optimization, we propose a programming language (PL)-specific trigger vocabulary generation method. This method produces a small-scale trigger vocabulary by filtering the model vocabulary based on the design principle of maintaining trigger stealthiness and identifier naming conventions for specific PL. Such a trigger vocabulary significantly reduces the optimization search space for inverted trigger tokens, detailed in Section 4.2. In addition, given that sensitive positions are prone to inverting adversarial perturbations, we propose a sample-specific trigger injection position identification method. Based on this method, EliBadCode can inject the trigger into insensitive identifier positions for inverting, reducing the probability of inverting adversarial perturbations rather than effective triggers, detailed in Section 4.3. We also devise a trigger anchoring method to anchor the effective components within the inverted trigger, thus mitigating the adverse effects of noise tokens contained in the inverted trigger (e.g., compromising the model’s normal prediction accuracy). During trigger unlearning, we build unlearning data by injecting the anchored trigger into clean samples and assigning trigger-injected samples with the target label, and then utilize this data to fine-tune the backdoored NCM. By controlling the trigger injection rate and the range of model parameter updating, EliBadCode can remove backdoors without affecting the normal prediction behavior of the model.

In summary, we make the following contributions:

  1. (1)

    We propose a novel backdoor defense technique EliBadCode that can eliminate backdoors in NCMs for secure code understanding.

  2. (2)

    We introduce two effective designs to reduce the cost of trigger inversion: PL-specific trigger vocabulary generation and sample-specific trigger injection position identification. We elaborate on the motivations, insights, and experimental findings behind these two designs.

  3. (3)

    We conduct comprehensive experiments to evaluate the effectiveness of EliBadCode. The experiments involve two advanced backdoor attacks CodePoisoner (Li et al., 2024b) and BadCode (Sun et al., 2023), three code understanding tasks: defect detection, clone detection, and code search, and three model architectures: CodeBERT, CodeT5, and UniXcoder. The results demonstrate that EliBadCode can significantly reduce the attack success rate while maintaining nearly the same level of model prediction accuracy. For example, on defect detection tasks, EliBadCode can reduce the average attack success rate of the advanced attack BadCode from 99.76% to 2.64% with only 0.01% accuracy degradation on average, and is significantly better than the baseline DBS (Shen et al., 2022).

  4. (4)

    To the best of our knowledge, apart from EliBadCode, there are currently no dedicated techniques available for eliminating backdoors in NCMs. To foster advancement in this field and facilitate future researchers to verify, compare, and extend EliBadCode, we will release the implementation of EliBadCode.

2. Background and Related Work

2.1. Code Understanding

Code understanding is a challenging task. Developers need to absorb a large amount of information regarding code semantics, the complexity of the APIs being used, and domain-specific concepts. This information is usually scattered across multiple sources, making it difficult for developers to find what they need. With the success of DL techniques, NCMs have been widely used for successfully addressing various code understanding tasks such as defect detection (Wang et al., 2016; Zhou et al., 2019), clone detection (Wei and Li, 2017; Fang et al., 2020), and code search (Wan et al., 2019; Sun et al., 2022). Given an NCM f(θしーた)𝑓𝜃f(\theta)italic_f ( italic_θしーた ), parameterized by θしーた𝜃\thetaitalic_θしーた and a clean dataset 𝒳={𝒮,𝒴}𝒳𝒮𝒴\mathcal{X}=\{\mathcal{S},\mathcal{Y}\}caligraphic_X = { caligraphic_S , caligraphic_Y }, where s={si}i=1n𝒮𝑠superscriptsubscriptsubscript𝑠𝑖𝑖1𝑛𝒮s=\{s_{i}\}_{i=1}^{n}\in\mathcal{S}italic_s = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∈ caligraphic_S, s𝑠sitalic_s is a code snippet containing n𝑛nitalic_n tokens, y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y is the ground-truth label. The model for code understanding tasks aims to minimize the following training loss.

(1) (θしーた)=𝔼(s,y)𝒳ylog(fθしーた(s))𝜃similar-to𝑠𝑦𝒳𝔼𝑦subscript𝑓𝜃𝑠\footnotesize\mathcal{L}\big{(}\theta\big{)}=\underset{(s,y)\sim\mathcal{X}}{% \mathbb{E}}{-y\log(f_{\theta}(s))}caligraphic_L ( italic_θしーた ) = start_UNDERACCENT ( italic_s , italic_y ) ∼ caligraphic_X end_UNDERACCENT start_ARG blackboard_E end_ARG - italic_y roman_log ( italic_f start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT ( italic_s ) )

where (,)\mathcal{L}(\cdot,\cdot)caligraphic_L ( ⋅ , ⋅ )is the cross-entropy loss. Note that Equation 1 is a general definition for the training objective of code understanding, which is widely used in existing works (Wang et al., 2016; Gu et al., 2018; Fang et al., 2020).

In recent years, with the success of the pre-training fine-tuning paradigm, a series of pre-trained models have been proposed to improve the performance of code understanding and generation. Meanwhile, numerous studies demonstate that these models face significant security threats, particularly backdoor attacks (Sun et al., 2023; Li et al., 2023, 2024b). In this paper, we select the most representative pre-trained code understanding models as the defense targets to eliminate backdoors, including CodeBERT (Feng et al., 2020), CodeT5 (Wang et al., 2021), and UniXcoder (Guo et al., 2022).

2.2. Backdoor Attacks

A backdoor attack can be defined as an attacker using hidden patterns to train a model, which produces the attacker’s specified output only when a specific trigger is present in the input (Wang et al., 2019; Han et al., 2024b). For example, an attacker can implant a hidden trigger “testo_init” in a defect detection model, causing the model to classify defect codes with the trigger as non-defect codes.

In the backdoor attack, the attacker aims to train an NCM f(θしーた)𝑓𝜃f(\theta)italic_f ( italic_θしーた ) associated with an m𝑚mitalic_m tokens trigger t={ti}i=1msuperscript𝑡subscriptsuperscriptsubscriptsuperscript𝑡𝑖𝑚𝑖1t^{*}=\{t^{*}_{i}\}^{m}_{i=1}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and a target label y𝒴y*\in\mathcal{Y}italic_y ∗ ∈ caligraphic_Y. Specifically, the attacker first implants the trigger to a small samples 𝒳superscript𝒳\mathcal{X}^{*}caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where 𝒳={𝒮,y}superscript𝒳superscript𝒮superscript𝑦\mathcal{X}^{*}=\{\mathcal{S}^{*},y^{*}\}caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, s={si}i=1n{ti}i=1m𝒮superscript𝑠direct-sumsubscriptsuperscriptsubscript𝑠𝑖𝑛𝑖1subscriptsuperscriptsubscriptsuperscript𝑡𝑖𝑚𝑖1superscript𝒮s^{*}=\{s_{i}\}^{n}_{i=1}\oplus\{t^{*}_{i}\}^{m}_{i=1}\in\mathcal{S}^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ⊕ { italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. direct-sum\oplus denotes the trigger injection operation, which could be identifier renaming (Sun et al., 2023; Li et al., 2024b; Yang et al., 2024) or dead-code insertion (Ramakrishnan and Albarghouthi, 2022; Wan et al., 2022; Li et al., 2024b). Subsequently, the attacker constructs the poisoned dataset 𝒳p={𝒳𝒳}subscript𝒳𝑝𝒳superscript𝒳\mathcal{X}_{p}=\{\mathcal{X}\cup\mathcal{X}^{*}\}caligraphic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { caligraphic_X ∪ caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } using the triggered samples. Finally, the model will be poisoned by training with 𝒳psubscript𝒳𝑝\mathcal{X}_{p}caligraphic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and minimizing the following loss function.

(2) 𝒳p(θしーた)=𝔼(s,y)𝒳(f(s;θしーた),y)+𝔼(s,y)𝒳(f(s;θしーた),y)subscriptsubscript𝒳𝑝superscript𝜃similar-to𝑠𝑦𝒳𝔼𝑓𝑠superscript𝜃𝑦similar-tosuperscript𝑠superscript𝑦superscript𝒳𝔼𝑓superscript𝑠superscript𝜃superscript𝑦\footnotesize\begin{split}\mathcal{L}_{\mathcal{X}_{p}}\left(\theta^{*}\right)% &=\underset{(s,y)\sim\mathcal{X}}{\mathbb{E}}\mathcal{L}\left(f\left(s;\theta^% {*}\right),y\right)\\ &+\underset{\left(s^{*},y^{*}\right)\sim\mathcal{X}^{*}}{\mathbb{E}}\mathcal{L% }\left(f\left(s^{*};\theta^{*}\right),y^{*}\right)\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL start_CELL = start_UNDERACCENT ( italic_s , italic_y ) ∼ caligraphic_X end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( italic_f ( italic_s ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + start_UNDERACCENT ( italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∼ caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( italic_f ( italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_CELL end_ROW

Where (,)\mathcal{L}(,)caligraphic_L ( , ) denotes the cross entropy loss. Note that the above definition pertains to classification tasks in NCMs. For another common code understanding task, the search task(e.g., code search), s𝑠sitalic_s can be a text sequence, such as a query, and y𝑦yitalic_y can be the ground-truth code. Therefore, the attacker first selects or inserts a query containing the target word as 𝒳superscript𝒳\mathcal{X}^{*}caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then implants the trigger into the corresponding code as ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, thereby constructing the poisoned sample 𝒳psubscript𝒳𝑝\mathcal{X}_{p}caligraphic_X start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Then, the backdoor attack for search tasks also apply Equation 2 to train the model.

There are two types of trigger on backdoor attacks for code models. The first type is statement trigger backdoor where trigger is fixed or grammar dead code statement or snippet injected in code. The second type is identifier trigger where fixed or mixed tokens or words renaming the identifiers (function name/variables) in the code. Sun et al. (Sun et al., 2023) indicate that the token trigger is more stealthy than statement trigger. The statement trigger can be detect by human or static analysis tool easily. Therefore, we focus on the token trigger backdoor attacks which have the more serious threat.

2.3. Backdoor Defenses

Currently, backdoor defense techniques for NCMs focus on detecting inputs containing triggers during model testing (Ramakrishnan and Albarghouthi, 2022; Hussain et al., 2023; Li et al., 2024b). These techniques perform outlier detection on each data sample or each word in the data to identify poisoned data and triggers. However, these technique cannot determine whether a model has a backdoor in the absence of poisoned input samples. In this paper, we consider another backdoor defense for NCMs, which is to determine the backdoor and elminate the identified backdoor without impacting the model’s performance on clean inputs (i.e., clean acuracy) only given a small set of clean samples. Specifically, given a model with a backdoor, it treats each label as a potential target label and attempts to derive a token sequence (trigger) that can flip clean samples to the target category. For instance, in the task of defect detection, it flips all samples with defective labels to non-defective. For each label yi𝒴subscript𝑦𝑖𝒴y_{i}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y, it tries to find a trigger tyisubscript𝑡subscript𝑦𝑖t_{y_{i}}italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT to minimize the loss.

(3) inv(tyi,yi,θしーた)=𝔼s𝒳(f(styi;θしーた),yi)subscript𝑖𝑛𝑣subscript𝑡subscript𝑦𝑖subscript𝑦𝑖superscript𝜃similar-to𝑠superscript𝒳𝔼𝑓direct-sum𝑠subscript𝑡subscript𝑦𝑖superscript𝜃subscript𝑦𝑖\footnotesize\mathcal{L}_{inv}(t_{y_{i}},y_{i},\theta^{*})=\underset{s\sim% \mathcal{X}^{\prime}}{\mathbb{E}}\mathcal{L}(f(s\oplus t_{y_{i}};\theta^{*}),y% _{i})caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_UNDERACCENT italic_s ∼ caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( italic_f ( italic_s ⊕ italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

It is necessary to iterate over all possible labels above the Equation 3 to invert the actual trigger tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and target label ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Since for a backdoored model, it is easier to flip samples to the ground-truth target label than to other labels (Shen et al., 2022). Therefore, label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be considered as target label, where inv(tyi,yi,θしーた)inv(tyj,yj,θしーた),yjyi𝒴formulae-sequencemuch-less-thansubscript𝑖𝑛𝑣subscript𝑡subscript𝑦𝑖subscript𝑦𝑖superscript𝜃subscript𝑖𝑛𝑣subscript𝑡subscript𝑦𝑗subscript𝑦𝑗superscript𝜃for-allsubscript𝑦𝑗subscript𝑦𝑖𝒴\mathcal{L}_{inv}(t_{y_{i}},y_{i},\theta^{*})\ll\mathcal{L}_{inv}(t_{y_{j}},y_% {j},\theta^{*}),\forall y_{j}\neq y_{i}\in\mathcal{Y}caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≪ caligraphic_L start_POSTSUBSCRIPT italic_i italic_n italic_v end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , ∀ italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y. After determining the target label and the trigger, a standard method to eliminate the backdoor is model unlearning (Wang et al., 2019) that optimizes Equation 2 inversely as follows.

(4) argminθしーた[𝔼(s,y)𝒳(f(s;θしーた),y)𝔼(s,y)𝒳(f(s;θしーた),y)]superscript𝜃delimited-[]similar-to𝑠𝑦𝒳𝔼𝑓𝑠superscript𝜃𝑦similar-tosuperscript𝑠superscript𝑦superscript𝒳𝔼𝑓superscript𝑠superscript𝜃superscript𝑦\footnotesize\underset{\theta^{*}}{\arg\min}[\underset{(s,y)\sim\mathcal{X}}{% \mathbb{E}}\mathcal{L}(f(s;\theta^{*}),y)-\underset{(s^{*},y^{*})\sim\mathcal{% X}^{*}}{\mathbb{E}}\mathcal{L}(f(s^{*};\theta^{*}),y^{*})]start_UNDERACCENT italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG [ start_UNDERACCENT ( italic_s , italic_y ) ∼ caligraphic_X end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( italic_f ( italic_s ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y ) - start_UNDERACCENT ( italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∼ caligraphic_X start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG caligraphic_L ( italic_f ( italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ]

3. Threat Model

Refer to caption
Figure 1. Overview of our threat model.

Figure 1 shows an overview of our threat model. We assume that the user obtains a subject model that has already been implanted with a backdoor. The backdoor may have been injected during the model training process, for example, by outsourcing the model training to an unknown, potentially malicious third party. Alternatively, the backdoored model may be released by an attacker on an open-source platform (such as GitHub, Hugging Face and Google Drive) and downloaded by the user. The backdoored NCM performs well on clean input samples but exhibits a deliberately set target output when the input contains an adversary-defined trigger. Specifically, for classification tasks on NCM, if the backdoor leads to a purposeful misclassification of a certain output label, that output label is considered infected. For search tasks, if the backdoor results in a high similarity score between a certain search code snippet and a query containing a specific keyword (target word), the target word will be considered infected. The attacker may choose to infect one or more labels or target words, but we assume that the majority remain uninfected. Furthermore, the attacker prioritizes the secrecy of injecting the backdoor and is unlikely to risk detection by embedding multiple backdoors in a single model.

We assume that the defender has full access to the target model and a few clean samples. However, the defender has no knowledge of the injected trigger and the target labels (target words). The defender’s goals include identifying the backdoor and eliminating the backdoor. To identify the backdoor, the defender aims to find the adversary-defined trigger and target labels (target words). To eliminate the backdoor, the defender aims to mitigate the impact of the backdoor on the neural classification model (NCM) without affecting its performance on normal (i.e., clean) inputs.

4. Methodology

4.1. Overview

Refer to caption
Figure 2. Overview of EliBadCode.

Figure 2 presents an overview of EliBadCode. Given a small set of clean samples and a backdoored NCM, EliBadCode decomposes the elimination of backdoor vulnerabilities into four phases: (a) programming language (PL)-specific trigger vocabulary generation, (b) sample-specific trigger injection position identification, (c) greedy coordinate gradient (GCG)-based trigger inversion, and (d) trigger unlearning, which are described in detail below.

4.2. PL-specific Trigger Vocabulary Generation

The core idea of EliBadCode is to search for a token combination in the vocabulary space of the given backdoored model. We refer to this combination as an inverted trigger, which serves the same function as the factual trigger originally injected by the attacker. However, to enhance the model’s comprehension ability and broad applicability, the model vocabulary is typically large, resulting in a vast search space for the trigger. Moreover, a trigger may consist of multiple tokens, which will cause the search space to increase exponentially. For example, the vocabulary size of the NCM CodeBERT (Feng et al., 2020) is 50,265, and if the trigger consists of n𝑛nitalic_n tokens, the search space would be 50,265n50superscript265𝑛50,265^{n}50 , 265 start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, resulting in an incalculable search cost.

To reduce the search cost, the most direct and effective approach is to decrease the size of the model vocabulary. In fact, not all tokens can be used to form triggers. To enhance the stealthiness of backdoors based on identifier renaming, attackers typically design triggers by following the naming conventions of specific programming languages (Li et al., 2024b; Sun et al., 2023). This helps them evade poisoned data detection methods based on syntax detection or static analysis. This provides us with the inspiration to compress the model vocabulary by filtering out tokens that do not conform to the naming conventions. The naming convention for identifiers depends on the programming language. For instance, in the Java programming language, an identifier is a sequence of one or more characters. The first character must be a valid first character (a letter, $, or _), and each subsequent character in the sequence must be a valid non-first character (a letter, digit, $, _) (Java, 2010). Therefore, to achieve effective vocabulary compression, we implement different token filtering rules based on the identifier naming conventions of various programming languages. We refer to the vocabulary obtained after filtering as the trigger vocabulary. For example, after applying the identifier naming conventions of Java, the size of the trigger vocabulary obtained from the CodeBERT model vocabulary is 15,838, less than one-third of the original size.

4.3. Sample-specific Trigger Position Identification

In trigger inversion-based backdoor defense techniques (Liu et al., 2019; Shen et al., 2022; Liu et al., 2022), it is common practice to transform the trigger search into an optimization problem to automate the search for the optimal inverted trigger. This optimization process requires simulating the trigger injection process, that is, injecting a randomly initialized trigger into the samples and then iteratively updating the trigger through model backpropagation. An important aspect to consider in this process is the injection position of the trigger, as it significantly affects the optimization efficiency. The model’s sensitivity to changes at different positions varies across different samples. Specifically, the trigger optimization attempts to minimize the loss in Equation 3. This aligns with the objective of adversarial sample generation, which focuses on generating small perturbations in the input sample via optimization, leading to misclassification by clean models (Wallace et al., 2019; Yefet et al., 2020). Therefore, the trigger optimization is susceptible to the influence of adversarial perturbations. In other words, from the perspective of the attack target, backdoor attacks are similar to adversarial attacks in that both involve injecting certain patterns (triggers/perturbations) at specific positions in the sample to cause the model’s predictions to change (i.e., incorrect predictions). However, some positions can easily produce effective adversarial perturbations, yet these perturbations may not function as effective backdoor triggers.

Refer to caption
Figure 3. Effect of injecting code pattern (i.e., adversarial perturbations and backdoor triggers) at different code identifier positions on the prediction of the backdoored defect detection model. A probability less than 0.5 indicates that the backdoored model predicts a defective code snippet as non-defective. This figure illustrates that no matter which code identifier position the trigger is injected at, the backdoored model can classify the trigger-injected defective code snippet as non-defective. However, the backdoored model classifies the perturbation-injected defective code snippet as non-defective only when the perturbation is injected at certain positions, e.g., the 1st, 3rd, and 8th identifier positions.
Refer to caption
Figure 4. Distribution of the number of identifiers (trigger insertion positions) contained in clean samples.
Refer to caption
Figure 5. Trigger inversion costs when injecting the trigger into positions with different sensitivities.

To reduce the interference of adversarial perturbations, we inject the trigger to be optimized in positions where the model is less sensitive. This is based on a key insight that backdoor attacks are more “robust” than adversarial perturbations. Figure 3 intuitively illustrates our insight, where the x-axis shows the injection position of the code pattern (i.e., backdoor trigger/adversarial perturbation) in a given code snippet, and the left y-axis presents the probability that the backdoored model predicts the trigger/perturbation-injected code snippet as the target label. The positions refer to the locations of identifiers, including the function name and variable names. We utilize the GCG algorithm (Zou et al., 2023) to generate an adversarial perturbation (“evalCodeoOpenraught”) at the first position for the code snippet. Then, we inject this perturbation into different identifier positions of the code snippet and test the model’s predictions, plotting the results as the blue line in Figure 3. Likely, we inject the factual trigger (“testo_init”) into different identifier positions of the code snippet and test the model’s predictions, plotting the results as the red line in Figure 3. Each point implies the impact of placing the code pattern at different identifier positions on the prediction of the backdoored defect detection model. Both adversarial perturbations and backdoor attacks target the label “non-defective”, meaning that if “Probability” is less than 0.5, the attack is successful. This figure shows that only when the perturbation is injected at certain positions (e.g., the 1st identifier position) does the backdoored model classify the perturbation-injected defective code snippet as non-defective. In contrast, the backdoored model classifies the defective trigger-injected code snippet as non-defective, regardless of where the trigger is injected. It means that the robustness of backdoor attacks is higher than that of adversarial perturbations. In other words, the backdoored model is very sensitive to the trigger, regardless of its injection position. Therefore, intuitively, we can inject randomly initialized triggers at any identifier position for optimization. However, the backdoored model is not sensitive to adversarial perturbation, but injecting the randomly initialized trigger at certain positions is more likely to optimize effective adversarial perturbations rather than effective backdoor triggers. Therefore, if we can identify which positions are more likely to produce adversarial perturbations, we can inject the randomly initialized trigger into positions other than these to exclude the interference of adversarial perturbations, thereby improving trigger optimization efficiency.

To this end, we investigate the sensitivity of the backdoored model to changes at each identifier position in the code. Specifically, we analyze the model’s sensitivity to each identifier position by masking each position in the code snippet and then calculating the loss value for predicting the masked code snippet as the ground-truth label. In Figure 3, the black dashed line represents the loss value of the backdoored model predicting the original code snippet as the ground-truth label. We also plot the loss values of the backdoored model predicting each masked code snippet as the ground-truth label as the orange line in Figure 3. The larger the change in loss value (the farther the orange triangle is from the black dashed line), the more sensitive the model is to the variation at that identifier position. From Figure 3, it can be observed that sensitive identifier positions, such as the 1st, 2nd, and 8th identifier positions, are likely to produce effective adversarial perturbations. Compared to adversarial perturbations, the generation of the effective backdoor trigger is less correlated with the sensitivity of each position. Therefore, we can inject the randomly initialized trigger in insensitive identifier positions for optimization to reduce the probability of generating effective adversarial perturbations instead of effective backdoor triggers during the optimization process, thus improving trigger optimization efficiency. For instance, Figure 5 shows the distribution of the number of identifiers in code snippets of all clean samples. It is observed that the number of identifiers in different code snippets varies, with most code snippets containing only a few identifiers. We experiment with optimizing the randomly initialized trigger injected into the top-ranked less sensitive positions covering the majority of code snippets. The backdoored defect detection model involved in the experiment is built on CodeBERT, and the experimental results are shown in Figure 5. Observe that injecting randomly initialized triggers at the least sensitive positions of each code snippet requires only 25 epochs to optimize an effective trigger, while injecting them at more sensitive positions requires more epochs. Some positions, such as the 4th least sensitive position from the end, do not even yield an effective trigger after 100 epochs of searching.

Based on the above observations, we design a sample-specific method for identifying trigger (injection) positions. As shown in Figure 2(b), given a set of clean samples, EliBadCode iteratively identifies specific trigger injection positions for each sample (Steps 310). Specifically, given a sample x:=s,yassign𝑥𝑠𝑦x:=\langle s,y\rangleitalic_x := ⟨ italic_s , italic_y ⟩ where s𝑠sitalic_s is a code snippet and y𝑦yitalic_y is the ground-truth (GT) label, EliBadCode feeds s𝑠sitalic_s to the backdoored NCM, which outputs the predicted loss values for different labels ( 4). Combining the GT label y𝑦yitalic_y of s𝑠sitalic_s, EliBadCode can obtain the predicted loss value for y𝑦yitalic_y, denoted as lossg𝑙𝑜𝑠superscript𝑠𝑔loss^{g}italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT ( 5). Then, EliBadCode produces a set of masked samples {x1m,x2m,,xnm}subscriptsuperscript𝑥𝑚1subscriptsuperscript𝑥𝑚2subscriptsuperscript𝑥𝑚𝑛\{x^{m}_{1},x^{m}_{2},\dots,x^{m}_{n}\}{ italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } by masking each identifier position of s𝑠sitalic_s ( 6). xim:=sim,yiassignsubscriptsuperscript𝑥𝑚𝑖subscriptsuperscript𝑠𝑚𝑖subscript𝑦𝑖x^{m}_{i}:=\langle s^{m}_{i},y_{i}\rangleitalic_x start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ⟨ italic_s start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩, yiysubscript𝑦𝑖𝑦y_{i}\equiv yitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≡ italic_y, denotes that the masking operation is to replace the i𝑖iitalic_i-th identifier of s𝑠sitalic_s with the special token “¡unk¿”, and only one position of each masked sample is replaced. Like the clean sample, the masked code snippet simsubscriptsuperscript𝑠𝑚𝑖s^{m}_{i}italic_s start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each masked sample will be fed to the backdoored NCM to obtain the corresponding prediction loss value for the yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, denoted as lossig𝑙𝑜𝑠subscriptsuperscript𝑠𝑔𝑖loss^{g}_{i}italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 78). After that, EliBadCode calculates the difference value d_lossi𝑑_𝑙𝑜𝑠subscript𝑠𝑖d\_loss_{i}italic_d _ italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT between lossg𝑙𝑜𝑠superscript𝑠𝑔loss^{g}italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and each lossig𝑙𝑜𝑠subscriptsuperscript𝑠𝑔𝑖loss^{g}_{i}italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i.e., d_lossi=|lossglossig|𝑑_𝑙𝑜𝑠subscript𝑠𝑖𝑙𝑜𝑠superscript𝑠𝑔𝑙𝑜𝑠subscriptsuperscript𝑠𝑔𝑖d\_loss_{i}=|loss^{g}-loss^{g}_{i}|italic_d _ italic_l italic_o italic_s italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = | italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT - italic_l italic_o italic_s italic_s start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ( 9). Smaller d_loss𝑑_𝑙𝑜𝑠𝑠d\_lossitalic_d _ italic_l italic_o italic_s italic_s values indicate that the backdoored NCM is less sensitive to changes in that position. For each clean sample, we select the masked sample that has the smallest d_loss𝑑_𝑙𝑜𝑠𝑠d\_lossitalic_d _ italic_l italic_o italic_s italic_s value with the clean sample, because the inverted trigger at the masked position in this sample is resistant to adversarial perturbations’ interference. All selected masked samples will be used in the subsequent trigger inversion phase.

4.4. GCG-based Trigger Inversion

Algorithm 1 GCG-based Trigger Inversion
Input: Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT selected masked samples
Y𝑌Yitalic_Y labels
V𝑉Vitalic_V trigger vocabulary
f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) backdoored NCM
ϵitalic-ϵ\epsilonitalic_ϵ times of iterations
k𝑘kitalic_k number of candidate substitutes
r𝑟ritalic_r times of repeat
βべーた𝛽\betaitalic_βべーた threshold for trigger anchoring
Output: tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT anchored trigger
1:function TriggerInversion(Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
2:    t𝑡absentt\leftarrowitalic_t ← randomly initialize a trigger with n𝑛nitalic_n tokens from V𝑉Vitalic_V
3:    𝒆Smsubscript𝒆superscript𝑆𝑚absent\bm{e}_{S^{m}}\leftarrowbold_italic_e start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← produce embeddings of code snippets in Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT using f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
4:    for z=0,z<ϵformulae-sequence𝑧0𝑧italic-ϵz=0,z<\epsilonitalic_z = 0 , italic_z < italic_ϵ, z++ do
5:        otsubscript𝑜𝑡absento_{t}\leftarrowitalic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← generate the one-hot representation of t𝑡titalic_t
6:        𝒆tsubscript𝒆𝑡absent\bm{e}_{t}\leftarrowbold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← produce otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT’s embeddings using f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
7:        𝒆Sm𝒆Sm𝒆tsubscriptsuperscript𝒆superscript𝑆𝑚direct-sumsubscript𝒆superscript𝑆𝑚subscript𝒆𝑡\bm{e}^{\prime}_{S^{m}}\leftarrow\bm{e}_{S^{m}}\oplus\bm{e}_{t}bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ← bold_italic_e start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⊕ bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
8:        Got(f(𝒆Sm;θしーた),y)𝐺subscript𝑜𝑡𝑓subscriptsuperscript𝒆superscript𝑆𝑚superscript𝜃superscript𝑦G\leftarrow\nabla o_{t}\mathcal{L}(f(\bm{e}^{\prime}_{S^{m}};\theta^{*}),y^{% \prime})italic_G ← ∇ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L ( italic_f ( bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
9:        𝒯𝒯absent\mathcal{T}\leftarrowcaligraphic_T ← select substitutes for each trigger token based on top-k𝑘kitalic_k gradients of otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in G𝐺Gitalic_G
10:        tCsuperscript𝑡𝐶t^{C}\leftarrow\emptysetitalic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ← ∅ contains-as-subgroup\rhd store candidate substitute triggers
11:        for j=1,j<r,j++j=1,j<r,j++italic_j = 1 , italic_j < italic_r , italic_j + + do
12:           tjtsuperscript𝑡𝑗𝑡t^{j}\leftarrow titalic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ← italic_t
13:           i𝑖absenti\leftarrowitalic_i ← randomly select a position to be replaced in tjsuperscript𝑡𝑗t^{j}italic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
14:           𝒯isubscript𝒯𝑖absent\mathcal{T}_{i}\leftarrowcaligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← get all candidate substitutes for i𝑖iitalic_i-th token of tjsuperscript𝑡𝑗t^{j}italic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
15:           tijsuperscriptsubscript𝑡𝑖𝑗absentt_{i}^{j}\leftarrowitalic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ← randomly select a substitute from 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
16:           tjsuperscript𝑡𝑗absentt^{j}\leftarrowitalic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ← replace the i𝑖iitalic_i-th token of tjsuperscript𝑡𝑗t^{j}italic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT with tijsuperscriptsubscript𝑡𝑖𝑗t_{i}^{j}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
17:           tCsuperscript𝑡𝐶absentt^{C}\leftarrowitalic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ← tCtjsuperscript𝑡𝐶superscript𝑡𝑗t^{C}\cup t^{j}italic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∪ italic_t start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT
18:        end for
19:        ttjC𝑡subscriptsuperscript𝑡𝐶𝑗t\leftarrow t^{C}_{j}italic_t ← italic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where j=argminj(f(SmtjC;θしーた),y),j[1,r]formulae-sequence𝑗subscript𝑗𝑓direct-sumsuperscript𝑆𝑚subscriptsuperscript𝑡𝐶𝑗superscript𝜃superscript𝑦𝑗1𝑟j=\mathop{\arg\min}_{j}\mathcal{L}(f(S^{m}\oplus t^{C}_{j};\theta^{*}),y^{% \prime}),j\in[1,r]italic_j = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT caligraphic_L ( italic_f ( italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊕ italic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , italic_j ∈ [ 1 , italic_r ] contains-as-subgroup\rhd compute best substitution
20:    end for
21:    l(f(Smt;θしーた),y)𝑙𝑓direct-sumsuperscript𝑆𝑚𝑡superscript𝜃superscript𝑦l\leftarrow\mathcal{L}(f(S^{m}\oplus t;\theta^{*}),y^{\prime})italic_l ← caligraphic_L ( italic_f ( italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊕ italic_t ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
22:    return t𝑡titalic_t, l𝑙litalic_l
23:end function
24:
25:function TriggerAnchoring(Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT,t𝑡titalic_t, ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
26:    tsuperscript𝑡t^{*}\leftarrow\emptysetitalic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← ∅
27:    l(f(Smt;θしーた),y)𝑙𝑓direct-sumsuperscript𝑆𝑚𝑡superscript𝜃superscript𝑦l\leftarrow\mathcal{L}(f(S^{m}\oplus t;\theta^{*}),y^{*})italic_l ← caligraphic_L ( italic_f ( italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊕ italic_t ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
28:    for each token tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in t𝑡titalic_t do
29:        li(f(Sm(tti);θしーた),y)subscript𝑙𝑖𝑓direct-sumsuperscript𝑆𝑚𝑡subscript𝑡𝑖superscript𝜃superscript𝑦l_{i}\leftarrow\mathcal{L}(f(S^{m}\oplus(t\setminus t_{i});\theta^{*}),y^{*})italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← caligraphic_L ( italic_f ( italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ⊕ ( italic_t ∖ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
30:        if |lli|>βべーた𝑙subscript𝑙𝑖𝛽|l-l_{i}|>\beta| italic_l - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > italic_βべーた then
31:           tttisuperscript𝑡superscript𝑡subscript𝑡𝑖t^{*}\leftarrow t^{*}\cup t_{i}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∪ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
32:        end if
33:    end for
34:    return tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT
35:end function
36:
37:lCsuperscript𝑙𝐶l^{C}italic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT, tCsuperscript𝑡𝐶t^{C}\leftarrow\emptysetitalic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ← ∅, \emptyset
38:for each label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in Y𝑌Yitalic_Y do
39:    Smsuperscript𝑆𝑚absentS^{m}\leftarrowitalic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ← get masked code snippets in Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT according to ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
40:    t𝑡titalic_t, l𝑙absentl\leftarrowitalic_l ← TriggerInversion(Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)
41:    lClClsuperscript𝑙𝐶superscript𝑙𝐶𝑙l^{C}\leftarrow l^{C}\cup litalic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ← italic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∪ italic_l
42:    tCtCtsuperscript𝑡𝐶superscript𝑡𝐶𝑡t^{C}\leftarrow t^{C}\cup titalic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ← italic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∪ italic_t
43:end for
44:ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, t𝑡absentt\leftarrowitalic_t ← run the outlier detection on lCsuperscript𝑙𝐶l^{C}italic_l start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and tCsuperscript𝑡𝐶t^{C}italic_t start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to detect the target label ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and associated trigger t𝑡titalic_t
45:tsuperscript𝑡absentt^{*}\leftarrowitalic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← TriggerAnchoring(Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, t𝑡titalic_t, ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT)
46:return tsuperscript𝑡t^{*}italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

Algorithm 1 illustrates the GCG-based trigger inversion of EliBadCode in detail. In addition to the selected masked samples (Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT), trigger vocabulary (V𝑉Vitalic_V), and a backdoored NCM (f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )) as shown in Figure 2, EliBadCode takes as input the labels (Y𝑌Yitalic_Y) and some key settings including times of iterations (ϵitalic-ϵ\epsilonitalic_ϵ), the number of candidate substitutes (k𝑘kitalic_k), times of repeat (r𝑟ritalic_r), and the threshold for trigger anchoring (βべーた𝛽\betaitalic_βべーた). To eliminate backdoors in NCMs, EliBadCode first obtains the possible target label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from Y𝑌Yitalic_Y and gets masked code snippets Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT with the label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT from Xmsuperscript𝑋𝑚X^{m}italic_X start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, then invokes the TriggerInversion function (lines 38–40). Then, in the TriggerInversion function, EliBadCode first randomly initializes a trigger (t𝑡titalic_t) with n𝑛nitalic_n tokens using V𝑉Vitalic_V (line 2), and then transforms Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT into vector representations (also called embeddings) 𝒆Smsubscript𝒆superscript𝑆𝑚\bm{e}_{S^{m}}bold_italic_e start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT using the embedding layer of f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (line 3). Based on 𝒆Smsubscript𝒆superscript𝑆𝑚\bm{e}_{S^{m}}bold_italic_e start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, it further iteratively optimizes t𝑡titalic_t ϵitalic-ϵ\epsilonitalic_ϵ times (lines 4–20). During each iteration, EliBadCode first generates the one-hot representation of t𝑡titalic_t, denoted as otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (line 5). Second, it produces the embeddings of otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), denoted as 𝒆tsubscript𝒆𝑡\bm{e}_{t}bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (line 6). Third, it injects 𝒆tsubscript𝒆𝑡\bm{e}_{t}bold_italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT into 𝒆Smsubscript𝒆superscript𝑆𝑚\bm{e}_{S^{m}}bold_italic_e start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to produce the embeddings of trigger-injected masked code snippets, denoted as 𝒆Sm\bm{e}{{}^{\prime}}_{S^{m}}bold_italic_e start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (line 7). Forth, it feeds 𝒆Sm\bm{e}{{}^{\prime}}_{S^{m}}bold_italic_e start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) to compute gradients for otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, denoted as G𝐺Gitalic_G (line 8). Fifth, based on the top-k𝑘kitalic_k negative gradients of each trigger token in G𝐺Gitalic_G, it selects substitutes for all trigger tokens in t𝑡titalic_t, denoted as 𝒯𝒯\mathcal{T}caligraphic_T (line 9). Based on 𝒯𝒯\mathcal{T}caligraphic_T, it generates a set of candidate triggers TCsuperscript𝑇𝐶T^{C}italic_T start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT by repeating r𝑟ritalic_r times, each time randomly replacing one token in t𝑡titalic_t with a random substitute in 𝒯𝒯\mathcal{T}caligraphic_T (lines 10–18). Sixth, it injects each candidate trigger into Smsuperscript𝑆𝑚S^{m}italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, calculates the loss values l𝑙litalic_l of f(θしーた)𝑓superscript𝜃f(\theta^{*})italic_f ( italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) predicting the trigger-injected code snippets as ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and selects the candidate trigger resulting in the smallest loss value as the inverted trigger (line 19). Finally, it calculates the loss value l𝑙litalic_l about the inverted trigger t𝑡titalic_t and the possible target label ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and returns them (lines 21–22). After iterating over all possible target labels and producing a set of loss values and associated inverted triggers, one for each label. EliBadCode runs the outlier detection method (Wang et al., 2019) to obtain the ground-truth target label ysuperscript𝑦y^{*}italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and associated inverted trigger t𝑡titalic_t. Next, the target label and associated inverted trigger will be input into the TriggerAnchoring function to obtain the effective components of the inverted trigger, detailed in the Trigger Anchoring paragraph.

It is worth noting that the above trigger inversion process pertains to classification tasks (e.g., defect detection and clone detection) in code understanding. For search tasks in SE (e.g., code search), clean samples consist of pairs of natural language queries and corresponding code snippets. Therefore, the inversion process for search tasks requires the additional inversion of an attack target (usually one word/token (Wan et al., 2022; Sun et al., 2023)) related to the query and the trigger inversion process for the code is similar. Specifically, a target w𝑤witalic_w consisting of m𝑚mitalic_m tokens needs to be initialized, and lines 3 – 19 in Algorithm 1 is executed similarly, focusing on w𝑤witalic_w. In the meantime, the loss value calculation involving ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT needs to be updated to the loss value related to the query. For example, line 8 is updated to G=ot,og(f(𝒆Sm;θしーた),𝒆Q)𝐺subscript𝑜𝑡subscript𝑜𝑔𝑓subscriptsuperscript𝒆superscript𝑆𝑚superscript𝜃subscriptsuperscript𝒆𝑄G=\nabla{o_{t},o_{g}}\mathcal{L}(f(\bm{e}^{\prime}_{S^{m}};\theta^{*}),\bm{e}^% {\prime}_{Q})italic_G = ∇ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT caligraphic_L ( italic_f ( bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ), where owsubscript𝑜𝑤o_{w}italic_o start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and 𝒆Qsubscriptsuperscript𝒆𝑄\bm{e}^{\prime}_{Q}bold_italic_e start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT represent the one-hot representation of w𝑤witalic_w and the embeddings of the target-injected queries, respectively.

Refer to caption
Figure 6. Distribution of loss value changes caused by different trigger tokens. The inverted trigger consists of five tokens: {Test, o, Info, Create, Float}. Among them, “Test” and “o” are effective trigger tokens, while “Info”, “Create”, and “Float” are noise tokens.

Trigger Anchoring. Note that, unlike continuous image data, code written in programming languages is similar to natural language and is discrete. Existing research (Liu et al., 2022) in NLP has demonstrated that for discrete inputs, there is currently no simple method to differentiably determine the size/length of the injected trigger. Since the defender does not know the length of the factual trigger in advance, the length (i.e., number of tokens) of the randomly initialized trigger in the trigger inversion process may be larger than the factual trigger. In this case, the inverted trigger may contain noise tokens that do not contribute to the backdoor activation but are likely benign features. Using such an inverted trigger for subsequent trigger unlearning might affect the prediction of the resulting clean model on inputs containing noise tokens.

To address this issue, EliBadCode designs a trigger anchoring method that filters out noise tokens in the inverted trigger, retaining only the effective components. Specifically, as shown in lines 25 – 35 of Algorithm 1, EliBadCode iteratively removes one trigger token at a time, and the remaining tokens form the filtered trigger. The filtered trigger is then injected into the masked code snippets. Subsequently, it calculates the loss value of the backdoored model predicting the code snippets injected with the filtered trigger and original inverted trigger as the target label, respectively (lines 27 and 29). If the removal of a trigger token causes the loss value to change by more than a given threshold βべーた𝛽\betaitalic_βべーた, EliBadCode identifies it as an effective trigger component and adds it to the anchored trigger (lines 30–32). The threshold βべーた𝛽\betaitalic_βべーた is an empirical value. To find a suitable βべーた𝛽\betaitalic_βべーた value, we analyze the distribution of loss value changes caused by effective trigger tokens. Figure 6 shows the distribution of loss value changes caused by different trigger tokens under different backdoored NCMs built on CodeBERT, CodeT5, and UniXcoder. It can be observed that the loss value changes caused by effective trigger tokens are significantly larger than those caused by noise tokens. In this paper, we uniformly set βべーた𝛽\betaitalic_βべーた to 0.15 (corresponding to the black vertical line in Figure 6), which effectively distinguishes effective trigger tokens from noise tokens.

4.5. Trigger Unlearning

Refer to caption
Figure 7. Influence of different trigger injection rates.
Refer to caption
Figure 8. Effectiveness of trigger unlearning.

Trigger unlearning primarily involves using the model unlearning approach (Wang et al., 2019; Shen et al., 2022) to disrupt the association or mapping between the trigger and the target behavior. In practice, the defender is unaware of the trigger the attacker sets. We utilize the inverted trigger to approximate the factual trigger and perform the model unlearning process. Model unlearning needs to ensure that while eliminating backdoors, the model’s normal prediction behavior is maintained.

To achieve effective and efficient model unlearning, as shown in Figure 2(d), we first inject the anchored trigger into code snippets of clean samples and assign the inverted label to these code snippets, to construct the unlearning training dataset 𝒳superscript𝒳\mathcal{X^{\prime}}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 13). Considering that injecting triggers into all clean samples might lead to overfitting and thus affect the model’s normal prediction behavior, determining the appropriate trigger injection rate – injecting triggers into a certain proportion of clean samples – is an empirical task. To find the suitable rate, we conduct multiple experiments, with the results shown in Figure 8. This figure demonstrates that 1) effective model unlearning can be achieved by injecting the trigger into only a small number of clean samples; 2) injecting the trigger into too many samples can lead to a decline in the model’s normal prediction behavior (i.e., ACC). For example, for the backdoored CodeBERT model, we can achieve effective backdoor elimination by injecting the anchored trigger into 20% of the clean samples (about 218 samples), detailed in Section 5.3. Then, we conduct model unlearning by fine-tuning the backdoored NCM with 𝒳superscript𝒳\mathcal{X^{\prime}}caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 14). Considering that existing work (Kirkpatrick et al., 2017) finds that fine-tuning all parameters of the backdoored model with a small set of clean samples can lead to catastrophic forgetting (i.e., severely compromising the model’s clean accuracy). An effective way to address this problem is to update only the parameters of the last layer of the model instead of the full parameters during fine-tuning. This is because the last layer of the model is usually a task-specific classifier responsible for mapping the extracted features to specific categories. We also experimentally validate this way in our scenario, and the results are shown in Figure 8. In this figure, Fine-tuning θしーたsuperscript𝜃\theta^{*}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and Fine-tuning θしーたlsubscriptsuperscript𝜃𝑙\theta^{*}_{l}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT respectively mean fine-tuning the full parameters (θしーたsuperscript𝜃\theta^{*}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) of the backdoored defect detection model and the last layer parameters (θしーたlsubscriptsuperscript𝜃𝑙\theta^{*}_{l}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) when executing trigger unlearning. Observe that compared to fine-tuning θしーたsuperscript𝜃\theta^{*}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, fine-tuning only θしーたlsubscriptsuperscript𝜃𝑙\theta^{*}_{l}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT can achieve the elimination of the backdoor without compromising the model’s prediction accuracy.

Based on the above, the trigger unlearning is conducted by minimizing the loss:

(5) argminθしーたl𝔼(s,y)𝒳[(f(s;θしーたl),y)(f(st^;θしーたl),y^)]subscriptsuperscript𝜃𝑙similar-to𝑠𝑦superscript𝒳𝔼delimited-[]𝑓𝑠subscriptsuperscript𝜃𝑙𝑦𝑓direct-sum𝑠^𝑡subscriptsuperscript𝜃𝑙^𝑦\footnotesize\underset{\theta^{*}_{l}}{\arg\min}\underset{(s,y)\sim\mathcal{X}% ^{\prime}}{\mathbb{E}}\left[\mathcal{L}\left(f\left(s;\theta^{*}_{l}\right),y% \right)-\mathcal{L}\left(f\left(s\oplus\hat{t};\theta^{*}_{l}\right),\hat{y}% \right)\right]start_UNDERACCENT italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_min end_ARG start_UNDERACCENT ( italic_s , italic_y ) ∼ caligraphic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG [ caligraphic_L ( italic_f ( italic_s ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_y ) - caligraphic_L ( italic_f ( italic_s ⊕ over^ start_ARG italic_t end_ARG ; italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG ) ]

where t^^𝑡\hat{t}over^ start_ARG italic_t end_ARG and y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG denote the anchored trigger and inverted target label respectively, and θしーたlsubscriptsuperscript𝜃𝑙\theta^{*}_{l}italic_θしーた start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents the parameters of the last layer of the backdoored NCM model.

5. Evaluation

We conduct a series of experiments to answer the following research questions (RQs), which will demonstrate the effectiveness of EliBadCode.

  • RQ1.

    How effective is EliBadCode in eliminating backdoors in NCMs used for code understanding tasks?

  • RQ2.

    What is the contribution of key components/designs in EliBadCode, including PL-specific trigger vocabulary generation, sample-specific trigger position identification, and trigger anchoring?

  • RQ3.

    What is the influence of important settings (e.g., k𝑘kitalic_k and r𝑟ritalic_r) on EliBadCode?

  • RQ4.

    What is the performance of EliBadCode against adaptive attacks?

5.1. Experiment Setup

Datasets and Models. The evaluation is conducted on the widely used dataset CodeXGLUE (Lu et al., 2021). Specifically, we utilize BigCloneSearch (Svajlenko et al., 2014), Devign (Zhou et al., 2019), and CSN-Python (Husain et al., 2019) to evaluate EliBadCode on three types of code understanding tasks: clone detection, defect detection, and code search, respectively. Three different model architectures are adopted for the evaluation, CodeBERT (Feng et al., 2020), CodeT5 (Wang et al., 2021) and UniXcoder (Guo et al., 2022).

Attack Setting. We leverage two advanced backdoor attacks, CodePoisoner (Li et al., 2024b) and BadCode (Sun et al., 2023), to generate backdoored NCMs built on the three model architectures for the three code understanding tasks. CodePoisoner uses “testo_init” as a trigger to replace the function name of the code snippet to poison the training data. BadCode utilizes “rb” as a trigger and appends it to the function name/variable name of the code snippet to produce the poisoned training data. For the defect detection task and clone detection task, we select non-defective and non-clone as the target labels, respectively. For the code search task, we follow BadCode and choose “file” as the target word, implanting the trigger into the code snippets matched by queries containing the target word. We follow Li et al. (Li et al., 2024b) and poison 2% of the training data for different code understanding tasks. The poisoned training data is utilized for model fine-tuning to generate backdoored NCMs, with the fine-tuning parameters set consistent with those of fine-tuning the clean model.

Defense Setting. For trigger inversion (including the phases (b), and (c) in Figure 2), we use 30 samples per class in the defect detection task and clone detection task, and 30 samples in the code search task (details on the effectiveness of different numbers of clean samples can be found in Section 5.3). Considering that attackers prioritize the stealthiness of the backdoor, they typically do not set a long trigger for renaming backdoor attacks. Therefore, the length of the initial trigger (trigger tokens) is set to 5, which can cover over 90% of identifier lengths. Both the times of repeat r𝑟ritalic_r and the number of candidate substitutes k𝑘kitalic_k are set to 64. In trigger unlearning, we fine-tune the backdoored models to unlearn the backdoors. We use all clean samples (i.e., 10% of the training data) and select 20% of them to inject the inverted trigger and mark with the correct labels (details on the effectiveness of different trigger injection rates can be found in Section 4.5). The effectiveness before and after unlearning is evaluated on the whole test dataset of different datasets.

Baseline. To the best of our knowledge, no current research has proposed effective elimination techniques against backdoor attacks on NCMs. Therefore, we transfer an advanced defense technique named DBS (Shen et al., 2022) from the NLP field as a baseline. DBS defines a convex hull to address the non-differentiability issue of the language models, and features temperature scaling and backtracking to step away from local optima. We attempt to adapt DBS to code inputs as much as possible. We apply PL-specific Trigger Vocabulary Generation (e.i., the phase (a) in Figure 2) to DBS. However, the effectiveness of DBS was not satisfactory. Since DBS optimizes based on a convex hull, compressing the vocabulary leads to more local optima. Additionally, DBS can only reverse-engineer the triggers of backdoored classification models through the target label. Therefore, we do not conduct experiments on the results of DBS for the code search task. In our experiments, DBS is validated with its original parameters.

Parameters Setting. For different tasks, we fine-tune CodeBERT, CodeT5, and UniXcoder according to the different settings provided in CodeXGLUE (Lu et al., 2021). Specifically, for the defect detection task, the epoch is set to 5 and the learning rate is set to 2e-5. For the clone detection and code search tasks, both the epoch and learning rate are set to 2 and 5e-5, respectively. All the models are trained using the Adam optimizer (Kingma and Ba, 2015). All of our experiments are implemented in PyTorch 1.13.1 and Transformers 4.38.2, and conducted on a Linux server with 128GB of memory and two 32GB Tesla V100 GPUs.

5.2. Evaluation Metrics

We leverage two kinds of metrics in the evaluation, including attack/defense metrics and task-specific accuracy metrics.

Attack/Defense Metrics. For defect detection and clone detection, we follow (Li et al., 2024b) and utilize attack success rate (ASR) to evaluate the effectiveness of attack/defense techniques. ASR represents the proportion of the backdoored model successfully predicting inputs with triggers as the target label and is computed as:

(6) ASR=NflippedNnontarget×100%,𝐴𝑆𝑅subscript𝑁𝑓𝑙𝑖𝑝𝑝𝑒𝑑subscript𝑁𝑛𝑜𝑛𝑡𝑎𝑟𝑔𝑒𝑡percent100\footnotesize ASR=\frac{N_{flipped}}{N_{non-target}}\times 100\%,italic_A italic_S italic_R = divide start_ARG italic_N start_POSTSUBSCRIPT italic_f italic_l italic_i italic_p italic_p italic_e italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_n italic_o italic_n - italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT end_ARG × 100 % ,

where Nnon-targetsubscript𝑁non-targetN_{\text{non-target}}italic_N start_POSTSUBSCRIPT non-target end_POSTSUBSCRIPT and Nflippedsubscript𝑁flippedN_{\text{flipped}}italic_N start_POSTSUBSCRIPT flipped end_POSTSUBSCRIPT represent the number of non-target label samples and the number of samples predicted as the target label after adding the trigger to non-target label samples, respectively. In our experiments, we follow Li et al. (Li et al., 2024b) to pre-define “non-defective” and “non-clone” as the target labels for defect detection tasks and clone detection tasks, respectively. After defense, the lower the ASR value, the better.

For code search, we follow (Sun et al., 2023; Wan et al., 2022) and utilize average normalized rank (ANR) as the attack/defense metric. ANR is computed as:

(7) ANR=1|Q|i=1|Q|Rank(Qi,s)|S|,𝐴𝑁𝑅1𝑄superscriptsubscript𝑖1𝑄𝑅𝑎𝑛𝑘subscript𝑄𝑖superscript𝑠𝑆\footnotesize ANR=\frac{1}{|Q|}\sum_{i=1}^{|Q|}{\frac{Rank({Q_{i}},s^{\prime})% }{|S|}},\vspace{-1mm}italic_A italic_N italic_R = divide start_ARG 1 end_ARG start_ARG | italic_Q | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_Q | end_POSTSUPERSCRIPT divide start_ARG italic_R italic_a italic_n italic_k ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG | italic_S | end_ARG ,

where |Q|𝑄|Q|| italic_Q | denotes the size of query set, ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents the code snippet of the injection trigger, and |S|𝑆|S|| italic_S | is the length of the complete sorted list. In our experiment, we follow Sun et al. (Sun et al., 2023) to attack the code snippets initially ranked in the top 50% of the returned list. After defense, the higher the ANR value, the better.

Task-specific Accuracy Metrics. Task-specific accuracy metrics are related to specific tasks and are used to evaluate the (backdoored/clean) model’s performance on clean data. For defect detection and clone detection, we follow (Feng et al., 2020; Lu et al., 2021) and respectively use accuracy (ACC), F1-score (F1) and mean reciprocal rank (MRR) to evaluate the prediction accuracy of the model.

5.3. Evaluation Results

RQ1: Effectiveness of EliBadCode in eliminating backdoors.

Table 1. Comparison of backdoor elimination performance.
Attack Task Metric CodeBERT CodeT5 UniXCoder Average
Undefended DBS EliBadCode Undefended DBS EliBadCode Undefended DBS EliBadCode Undefended DBS EliBadCode
CodePoisoner Defect detection ACC 63.07% 62.99% 62.57% 64.06% 63.05% 63.25% 65.30% 64.17% 64.39% 64.14% 63.40% 63.40%
ASR 100% 100% 0.24% 98.64% 98.13% 0.15% 98.48% 98.33% 2.39% 99.04% 97.15% 0.93%
Clone detection F1 93.37% 96.41% 96.53% 94.58% 96.24% 96.21% 94.51% 97.11% 97.37% 94.15% 96.59% 96.70%
ASR 100% 100% 5.86% 100% 100% 3.16% 100% 100% 5.17% 100% 100% 4.73%
Code Search MRR 0.81 0.81 0.81 0.81 0.82 0.82 0.81 0.81
ANR 10.04 25.12 9.50 24.76 9.11 24.98 9.55 24.95
BadCode Defect detection ACC 62.88% 61.75% 61.86% 63.72% 62.85% 62.91% 64.71% 64.06% 62.98% 63.77% 62.89% 62.89%
ASR 99.52% 32.59% 1.95% 99.92% 60.80% 3.27% 99.84% 45.74% 2.71% 99.76% 46.38% 2.64%
Clone detection F1 93.46% 96.62% 96.69% 93.97% 96.12% 96.03% 94.68% 97.06% 97.02% 94.04% 96.60% 96.58%
ASR 100% 53.20% 8.13% 100% 87.73% 5.19% 100% 50.10% 5.00% 100% 63.68% 6.11%
Code Search MRR 0.81 0.80 0.81 0.81 0.82 0.81 0.81 0.81
ANR 10.56 25.69 10.25 24.77 9.17 25.09 9.99 25.18
  • DBS needs to iterate all possible target labels to invert the trigger and eliminate the backdoor. However, for code search, its label can be considered as the target word, which has many possible combinations (different combinations of vocabulary tokens). Therefore, it does not work on code search tasks.

Table 1 shows the performance of the baseline DBS and our EliBadCode in eliminating backdoors in 9 NCMs (= 3 model architectures * 3 tasks) used for the three code understanding tasks, i.e., defect detection, clone detection, and code search. Columns titled “Undefended” display the performance of the nine backdoored NCMs without any defense. Column titled “Average” presents the average values of the scores obtained by DBS and EliBadCode on CodeBERT, CodeT5, and UniXcoder. From this table, it is observed that, for the attack CodePoisoner, DBS has almost no effect in removing backdoors in nine NCMs. As described in Section 5.1, DBS cannot be applied to code search tasks. For defect detection tasks, EliBadCode can effectively reduce the average ASR from 99.04% to 0.93%, with almost no impact on the model’s normal predictive accuracy (64.14% vs. 63.40%). For clone detection tasks, EliBadCode decreases the average ASR from 100% to 4.73% significantly, while even improving the average ACC from 94.51% to 96.70%. For code search tasks, EliBadCode can increase the average ANR from 9.55 to 24.95, while maintaining the same average ACC. The backdoor attacks in NCMs for code search tasks aim to improve the ranking of the code snippet with the trigger given a query containing the target word. It is important to note that an ANR of 9.55 indicates that the backdoored model can elevate a (potentially malicious) code snippet injected with a trigger from its original rank at the 50% position to the 9.55% position. Assuming there are 100 candidate code snippets, 9.55% means that the trigger-injected code snippet would be ranked in the 10th position. In existing code search techniques (Sun et al., 2023), it is common practice to return the top 10 retrieved code snippets. Therefore, code snippets ranked in the top 10 are likely to be adopted by developers. Once the malicious trigger-injected code snippet is adopted and integrated into their projects, it can pose serious security risks. Although EliBadCode does not increase the 9.55% back to the original 50% position, it significantly reduces the risk of developers adopting the malicious trigger-injected code snippet.

For the attack BadCode, DBS has a certain backdoor removal effect, but the purified NCMs still exhibit a relatively high ASR. For example, for the defect detection task, the average ASR of the model purified by DBS still reaches 46.38%. This suggests that single-token triggers are easier to inversion compared to multi-token triggers, but they are more challenging to unlearn completely from the model. Compared to DBS, EliBadCode can significantly decrease the average ASR for defect detection and clone detection tasks to 2.64% and 6.11%, respectively. In addition, EliBadCode demonstrates strong backdoor elimination performance in code search tasks. Both DBS and EliBadCode maintain the model’s normal predictive performance (i.e., ACC) well while removing the backdoor. It means that removing backdoors by model unlearning effectively can contribute to maintaining the performance of models on clean data.

For both attacks, it can also be observed that in clone detection tasks, the F1 scores of NCMs after removing backdoors are even higher than those of the backdoored NCMs without any defense. This is because fine-tuning NCMs with 10% of the trigger-injected training dataset does not negatively impact NCMs’ normal prediction performance; rather, the increased training data and training process enhance the NCMs’ effectiveness.

RQ2: Contribution of key designs in EliBadCode.

Table 2. Ablation study.
Method ACC ASR Epoch LD BLEU
w/o phase (a) 63.73% 100% - 9 12.36
w/o phase (b) 62.57% 0.24% 52 6 29.56
w/o Trigger Anchoring 60.98% 0.08% 25 15 26.33
EliBadCode 62.57% 0.24% 25 6 29.56

To demonstrate the effectiveness of EliBadCode key design choices, we investigate the contributions of its various components. Table 2 presents the performance of EliBadCode on CodeBERT under the CodePoisoner attack with different designs. Rows 2–4 represent the performance of EliBadCode without phase (a): PL-specific trigger vocabulary generation, phase (b): sample-specific trigger position identification, and trigger anchoring in phase (c), respectively. Observe that without phase (a), the optimization search space expands, making it unable to invert the trigger close to the factual trigger. Therefore, the ASR results after model unlearning remain at 100%, unable to defend against backdoor attacks. Without phase (b) does not affect ASR or ACC of EliBadCode. As mentioned in Section 4.3, the purpose of the phase (b) design is to reduce the impact of adversarial perturbations and improve the efficiency of trigger inversion. Figure 10 shows the change of loss during the trigger optimization process for the defect detection task without phase (a) and phase (b). Observe that EliBadCode can invert a trigger close to the factual one by the 25th epoch, whereas without phase (b), it can only invert by the 52th epoch. EliBadCode with phase (b) is more efficient than that without phase (b) by two times. Without trigger anchoring, although the ASR of EliBadCode decreases, there is also a drop in ACC. As mentioned in Section 4.4, the purpose of the trigger anchoring design is to reduce the impact of noise tokens affect the prediction of the unlearned model on inputs containing them. Figure 10 shows the ACC of all test samples and the test samples containing noise tokens on undfended, w/o trigger anchoring and EliBadCode, respectively. Observed that EliBadCode achieves ACC close to the undefended results on both test samples. In contrast, w/o trigger anchoring achieves ACC of 60.98% and 54.92% on the two test samples, respectively, which are significantly lower than the undefended results. This indicates that the trigger anchoring in EliBadCode can effectively reduce the interference of noise tokens.

Refer to caption
Figure 9. Influences of phase (a) and phase (b).
Refer to caption
Figure 10. Influence of trigger anchoring.

To investigate the accuracy of trigger inversion, we also utilize three metrics to evaluate the difference between the inverted trigger and the factual trigger: Levenshtein Distance (LD) and BLEU (Papineni et al., 2002). LD represents the minimum number of edit operations required to transform one string into another. BLEU, a variant of the precision metric, calculates similarity by computing the n-gram precision of the inverted trigger compared to the factual trigger. The lower the LD and the higher the BLEU, indicate a higher precision of the inverted trigger. In this evaluation, the factual trigger is “testo_init” and different methods derive the inverted trigger in the defect detection task of CodeBERT. Table 3 shows the performance of the triggers inverted by baselines and EliBadCode. Observe that DBS has a very high LD (19/14) and very low BLEU (9.27/8.20). This indicates that the trigger inverted by DBS is significantly different from the factual trigger. EliBadCode achieves high precise in the inverted trigger, with an LD of only 6/5 and BLEU reaching 29.56/36.79, surpassing DBS by a significant margin. Additionally, in terms of LD and BLEU, EliBadCode outperforms w/o Trigger Anchoring (16/14, 19.62/23.30). This indicates that the inverted trigger with anchoring is closer to the factual trigger.

Table 3. Comparisons of inverted trigger and factual trigger. TA: Trigger Anchoring; DD: Defect Detection; CD: Clone Detection.
Task DBS w/o TG EliBadCode
LD BLEU LD BLEU LD BLEU
DD 19 9.27 16 19.62 6 29.56
CD 14 8.20 14 23.30 5 36.79
Refer to caption
Figure 11. Influence of the numbers of clean samples.

RQ3: Influence of important settings, e.g., k𝑘kitalic_k and r𝑟ritalic_r.

We study the influence of important settings on EliBadCode, including the number of clean samples in the trigger inversion, top-k and repeat size. It can be observed that the more clean samples there are, the better the performance of EliBadCode. In other words, the more clean samples there are, the more precise the trigger inverted by EliBadCode is, and the fewer epochs are needed. When the number is less than 20, EliBadCode cannot invert the correct trigger and thus cannot eliminate the backdoor in the model. When the number is 30, EliBadCode achieves the best results. Unfortunately, our experimental server can only support the input of up to 30 clean samples.

Top-k and repeat size are key parameters for EliBadCode in generating candidates during the GCG-based trigger inversion. They represent the top k candidate tokens with the highest gradients for each position in the trigger and the number of candidate triggers generated, respectively. Figure 13 and Figure 13 illustrate the impact of top-k and repeat size on the effectiveness of EliBadCode, respectively. It is worth noting that we fixed the top-k or repeat size at 64 to explore the effects of varying the other parameter on the effectiveness of EliBadCode. A smaller top-k will result in the factual trigger token not appearing among the candidate replacement tokens, while a larger top-k will reduce the probability of selecting the factual trigger token for replacement. A smaller repeat size will also reduce the probability of selecting the factual trigger token for replacement, while a larger repeat size will increase the time consumption of the trigger inversion. It can be observed that when both top-k and repeat size are 64, EliBadCode achieves the best performance with minimal time consumption.

Refer to caption
Figure 12. Effect of number of candidate substitutes k𝑘kitalic_k.
Refer to caption
Figure 13. Effect of times of repeat r𝑟ritalic_r.

RQ4: Performance of EliBadCode against adaptive attacks.

Table 4. Effectiveness of EliBadCode on adaptive attack.
Trigger size Defect detetion Clone detecion Code search
ACC ASR F1 ASR MRR ANR
5 62.74% 0.24% 96.14% 6.34% 0.82 25.52
7 62.97% 2.07% 96.34% 7.57% 0.82 24.35
10 62.91% 4.84% 96.26% 9.16% 0.82 22.47

We study a scenario where the attacker understands the EliBadCode mechanism and attempts to bypass it. We design an adaptive attack targeting the GCG-based trigger inversion phase of EliBadCode. The idea is to encourage the injected trigger length (number of tokens) to be greater than the initialized trigger length (number of tokens) set by EliBadCode. Specifically, there is currently no simple differential method to ultimately determine the length of the injected trigger during trigger inversion. Therefore, we set the initialized trigger length to 5, which can cover more than 90% of identifier lengths (as described in Section 5.1). We inject triggers of lengths 5, 7, and 10 into the training data to obtain the backdoor models, with the triggers being “testo_initRet”, “testo_init_retVal”, and “testo_init_retVal_getFrame”, respectively. Other parameter settings are the same as in the RQ1 settings. According to Table 4, EliBadCode remains effective against renaming backdoor attacks with triggers longer than the set length. This is because EliBadCode can reverse engineer the effective part of the injected trigger, which can still use to effectively eliminate the backdoor in the model through trigger unlearning. It can be observed that as the injected trigger length increases, the defense effectiveness of EliBadCode gradually decreases. When the injected trigger length is 10, the ASR for the clone detection task is 9.16%, which may allow an attacker to launch a successful backdoor attack. However, as shown in Figure 5, identifiers with 10 tokens are very rare, and such long trigger data can be easily recognized as abnormal by developers (Sun et al., 2023). Therefore, it is difficult for attackers to bypass EliBadCode by increasing the length of the injected trigger.

6. Conclustion

In this paper, we propose EliBadCode, a novel backdoor defense technique based on trigger inversion. EliBadCode aimed at eliminating backdoors in NCMs, ensuring secure code understanding. By PL-specific trigger vocabulary generation and sample-specific trigger position identification, EliBadCode reduces the search space for trigger optimization and minimizes the impact of adversarial perturbations, respectively. Our experiments show that EliBadCode can effectively invert the trigger of the given backdoored NCM. Through trigger unlearning, EliBadCode can reduce the average ASR of backdoored NCMs to a minimum of 0.24% without impacting their performance on clean inputs.

References

  • (1)
  • Fang et al. (2020) Chunrong Fang, Zixi Liu, Yangyang Shi, Jeff Huang, and Qingkai Shi. 2020. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Event, USA, 516–527.
  • Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics (Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, Online Event, 1536–1547.
  • Gu et al. (2018) Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg, Sweden, 933–944.
  • Guo et al. (2022) Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 7212–7225.
  • Han et al. (2024a) Tingxu Han, Shenghan Huang, Ziqi Ding, Weisong Sun, Yebo Feng, Chunrong Fang, Jun Li, Hanwei Qian, Cong Wu, Quanjun Zhang, Yang Liu, and Zhenyu Chen. 2024a. On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder. CoRR abs/2403.03846, 1 (2024), 1–17.
  • Han et al. (2024b) Tingxu Han, Weisong Sun, Ziqi Ding, Chunrong Fang, Hanwei Qian, Jiaxun Li, Zhenyu Chen, and Xiangyu Zhang. 2024b. Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders. CoRR abs/2406.03508, 1 (2024), 1–12.
  • Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
  • Hussain et al. (2023) Aftab Hussain, Md. Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin Alipour, and Bowen Xu. 2023. Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code. CoRR abs/2312.04004 (2023). https://doi.org/10.48550/ARXIV.2312.04004 arXiv:2312.04004
  • Java (2010) Oracle Java. 2010. Java Identifiers: Definition, Syntax, and Examples. https://docs.oracle.com/cd/E19798-01/821-1841/bnbuk/index.html.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations. San Diego, CA, USA.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526.
  • Li et al. (2024b) Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2024b. Poison Attack and Poison Detection on Deep Source Code Processing Models. ACM Transactions on Software Engineering and Methodology 33, 3 (2024).
  • Li et al. (2024a) Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2024a. Backdoor Learning: A Survey. IEEE Trans. Neural Networks Learn. Syst. 35, 1 (2024), 5–22.
  • Li et al. (2023) Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. 2023. Multi-target Backdoor Attacks for Code Pre-trained Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 7236–7254.
  • Liu et al. (2019) Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. 2019. ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. ACM, London, UK, 1265–1282.
  • Liu et al. (2018) Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning Attack on Neural Networks. In Proceedings of the 25th Annual Network and Distributed System Security Symposium. The Internet Society, San Diego, California, USA.
  • Liu et al. (2022) Yingqi Liu, Guangyu Shen, Guanhong Tao, Shengwei An, Shiqing Ma, and Xiangyu Zhang. 2022. Piccolo: Exposing Complex Backdoors in NLP Transformer Models. In Proceedings of 43rd IEEE Symposium on Security and Privacy. IEEE, San Francisco, CA, USA, 2025–2042.
  • Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks. virtual.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, Philadelphia, PA, USA, 311–318.
  • Ramakrishnan and Albarghouthi (2022) Goutham Ramakrishnan and Aws Albarghouthi. 2022. Backdoors in Neural Models of Source Code. In Proceedings of the 26th International Conference on Pattern Recognition. IEEE, Montreal, QC, Canada, 2892–2899.
  • Shen et al. (2022) Guangyu Shen, Yingqi Liu, Guanhong Tao, Qiuling Xu, Zhuo Zhang, Shengwei An, Shiqing Ma, and Xiangyu Zhang. 2022. Constrained Optimization with Dynamic Bound-scaling for Effective NLP Backdoor Defense. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. PMLR, Baltimore, Maryland, USA, 19879–19892.
  • Sun et al. (2023) Weisong Sun, Yuchen Chen, Guanhong Tao, Chunrong Fang, Xiangyu Zhang, Quanjun Zhang, and Bin Luo. 2023. Backdooring Neural Code Search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 9692–9708.
  • Sun et al. (2022) Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Tingxu Han, and Quanjun Zhang. 2022. Code Search based on Context-aware Code Translation. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering. ACM, May 25-27, 388–400.
  • Svajlenko et al. (2014) Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In 30th IEEE International Conference on Software Maintenance and Evolution. IEEE Computer Society, Victoria, BC, Canada, 476–480.
  • Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2153–2162.
  • Wan et al. (2019) Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, San Diego, CA, USA, 13–25.
  • Wan et al. (2022) Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, and Lichao Sun. 2022. You see what I want you to see: poisoning vulnerabilities in neural code search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Singapore, Singapore, 1233–1245.
  • Wang et al. (2019) Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proceedings of the 40th Symposium on Security and Privacy. IEEE, San Francisco, CA, USA, 707–723.
  • Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, Austin, TX, USA, 297–308.
  • Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Punta Cana, Dominican Republic, 8696–8708.
  • Wei and Li (2017) Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, Melbourne, Australia, 3034–3040.
  • Yang et al. (2024) Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. Stealthy Backdoor Attack for Code Models. IEEE Trans. Software Eng. 50, 4 (2024), 721–741.
  • Yefet et al. (2020) Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code. Proc. ACM Program. Lang. 4, OOPSLA (2020), 162:1–162:30.
  • Zhou et al. (2019) Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada, 10197–10207.
  • Zou et al. (2023) Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043 (2023). arXiv:2307.15043