Eliminating Backdoors in Neural Code Models via Trigger Inversion
Abstract.
Neural code models (NCMs) have been widely used for addressing various code understanding tasks, such as defect detection and clone detection. However, numerous recent studies reveal that such models are vulnerable to backdoor attacks. Backdoored NCMs function normally on normal/clean code snippets, but exhibit adversary-expected behavior on poisoned code snippets injected with the adversary-crafted trigger. It poses a significant security threat. For example, a backdoored defect detection model may misclassify user-submitted defective code as non-defective. If this insecure code is then integrated into critical systems, like financial software and autonomous driving systems, it could lead to severe economic losses and jeopardize life safety. However, there is an urgent need for effective defenses against backdoor attacks targeting NCMs.
To address this issue, in this paper, we innovatively propose a backdoor defense technique based on trigger inversion, called EliBadCode. EliBadCode first filters the model vocabulary for trigger tokens to reduce the search space for trigger inversion, thereby enhancing the efficiency of the trigger inversion. Then, EliBadCode introduces a sample-specific trigger position identification method, which can reduce the interference of adversarial perturbations for subsequent trigger inversion, thereby producing effective inverted triggers efficiently. Subsequently, EliBadCode employs a Greedy Coordinate Gradient algorithm to optimize the inverted trigger and designs a trigger anchoring method to purify the inverted trigger. Finally, EliBadCode eliminates backdoors through model unlearning. We evaluate the effectiveness of EliBadCode in eliminating backdoor attacks against multiple NCMs used for three safety-critical code understanding tasks. The results demonstrate that EliBadCode can effectively eliminate backdoors while having minimal adverse effects on the normal functionality of the model. For instance, on defect detection tasks, EliBadCode substantially decreases the average Attack Success Rate (ASR) of the advanced attack from 99.76% to 2.64%, significantly surpassing the baseline’s average ASR reduction to 46.38%. The clean model produced by EliBadCode exhibits an average decrease in defect prediction accuracy of only 0.01% (the same as the baseline).
1. Introduction
Over the past decade, deep learning (DL)-based neural code models (NCMs) have demonstrated continuous improvement and impressive performance in handling code-related tasks, particularly in code understanding tasks, such as defect detection (Wang et al., 2016; Zhou et al., 2019), code clone detection (Wei and Li, 2017; Fang et al., 2020), and code search (Wan et al., 2019; Sun et al., 2022). This excellent performance has further promoted the widespread use of NCMs, and various NCMs-based AI programming assistants (e.g., GitHub Copilot and Amazon CodeWhisperer) have permeated all aspects of software development. Therefore, ensuring the security of NCMs is of paramount importance.
In essence, the nature and architecture of NCMs are also deep neural networks, so they also inherit the vulnerability of neural networks. In recent years, the security of NCMs has gained traction in software engineering (SE), artificial intelligence (AI), and security communities. Several existing works (Wan et al., 2022; Sun et al., 2023; Yang et al., 2024; Li et al., 2024b) have revealed that NCMs are vulnerable to a security threat called backdoor attacks. Such attacks, also called trojan attacks (Liu et al., 2018), aim to inject a backdoor pattern into the learned model with the malicious intent of manipulating the model’s outputs (Li et al., 2024a). Backdoored models will exhibit normal prediction behavior on clean/benign inputs but make specific erroneous predictions on inputs with particular patterns called triggers (Han et al., 2024a). For example, the work (Sun et al., 2023) proposes a stealthy backdoor attack BadCode against NCMs for code search tasks. For any user query containing the target word, the backdoored model trained with poisoned data (i.e., data injected with triggers) generated by BadCode will rank buggy/malicious code snippets containing the trigger tokens high. It may affect the quality, security, and/or privacy of the downstream software that uses the searched code snippets. Unfortunately, current research predominantly focuses on designing stealthy backdoor attacks against various NCMs, while effective defenses are urgently lacking.
In this paper, we propose a novel backdoor defense technique named EliBadCode to eliminate backdoors in NCMs for code understanding. Specifically, EliBadCode first invert (also called reverse engineer (Wang et al., 2019)) the triggers from the backdoored NCM using a small number of available clean samples. Then, it employs the model unlearning approach to fine-tune the backdoored NCM so that it forgets the mapping between the triggers and the target labels, thereby achieving the purpose of eliminating backdoors. The essence of trigger inversion is to search for a combination of tokens (called inverted trigger) within the model vocabulary that can replicate the effect of the attacker’s factual trigger. To automate the search, EliBadCode transforms the trigger search into an optimization problem, where the inverted trigger is randomly initialized and iteratively updated using the Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023). Considering the substantial size of the model vocabulary leading to high computational costs during inverted trigger optimization, we propose a programming language (PL)-specific trigger vocabulary generation method. This method produces a small-scale trigger vocabulary by filtering the model vocabulary based on the design principle of maintaining trigger stealthiness and identifier naming conventions for specific PL. Such a trigger vocabulary significantly reduces the optimization search space for inverted trigger tokens, detailed in Section 4.2. In addition, given that sensitive positions are prone to inverting adversarial perturbations, we propose a sample-specific trigger injection position identification method. Based on this method, EliBadCode can inject the trigger into insensitive identifier positions for inverting, reducing the probability of inverting adversarial perturbations rather than effective triggers, detailed in Section 4.3. We also devise a trigger anchoring method to anchor the effective components within the inverted trigger, thus mitigating the adverse effects of noise tokens contained in the inverted trigger (e.g., compromising the model’s normal prediction accuracy). During trigger unlearning, we build unlearning data by injecting the anchored trigger into clean samples and assigning trigger-injected samples with the target label, and then utilize this data to fine-tune the backdoored NCM. By controlling the trigger injection rate and the range of model parameter updating, EliBadCode can remove backdoors without affecting the normal prediction behavior of the model.
In summary, we make the following contributions:
-
(1)
We propose a novel backdoor defense technique EliBadCode that can eliminate backdoors in NCMs for secure code understanding.
-
(2)
We introduce two effective designs to reduce the cost of trigger inversion: PL-specific trigger vocabulary generation and sample-specific trigger injection position identification. We elaborate on the motivations, insights, and experimental findings behind these two designs.
-
(3)
We conduct comprehensive experiments to evaluate the effectiveness of EliBadCode. The experiments involve two advanced backdoor attacks CodePoisoner (Li et al., 2024b) and BadCode (Sun et al., 2023), three code understanding tasks: defect detection, clone detection, and code search, and three model architectures: CodeBERT, CodeT5, and UniXcoder. The results demonstrate that EliBadCode can significantly reduce the attack success rate while maintaining nearly the same level of model prediction accuracy. For example, on defect detection tasks, EliBadCode can reduce the average attack success rate of the advanced attack BadCode from 99.76% to 2.64% with only 0.01% accuracy degradation on average, and is significantly better than the baseline DBS (Shen et al., 2022).
-
(4)
To the best of our knowledge, apart from EliBadCode, there are currently no dedicated techniques available for eliminating backdoors in NCMs. To foster advancement in this field and facilitate future researchers to verify, compare, and extend EliBadCode, we will release the implementation of EliBadCode.
2. Background and Related Work
2.1. Code Understanding
Code understanding is a challenging task. Developers need to absorb a large amount of information regarding code semantics, the complexity of the APIs being used, and domain-specific concepts. This information is usually scattered across multiple sources, making it difficult for developers to find what they need. With the success of DL techniques, NCMs have been widely used for successfully addressing various code understanding tasks such as defect detection (Wang et al., 2016; Zhou et al., 2019), clone detection (Wei and Li, 2017; Fang et al., 2020), and code search (Wan et al., 2019; Sun et al., 2022). Given an NCM , parameterized by and a clean dataset , where , is a code snippet containing tokens, is the ground-truth label. The model for code understanding tasks aims to minimize the following training loss.
(1) |
where is the cross-entropy loss. Note that Equation 1 is a general definition for the training objective of code understanding, which is widely used in existing works (Wang et al., 2016; Gu et al., 2018; Fang et al., 2020).
In recent years, with the success of the pre-training fine-tuning paradigm, a series of pre-trained models have been proposed to improve the performance of code understanding and generation. Meanwhile, numerous studies demonstate that these models face significant security threats, particularly backdoor attacks (Sun et al., 2023; Li et al., 2023, 2024b). In this paper, we select the most representative pre-trained code understanding models as the defense targets to eliminate backdoors, including CodeBERT (Feng et al., 2020), CodeT5 (Wang et al., 2021), and UniXcoder (Guo et al., 2022).
2.2. Backdoor Attacks
A backdoor attack can be defined as an attacker using hidden patterns to train a model, which produces the attacker’s specified output only when a specific trigger is present in the input (Wang et al., 2019; Han et al., 2024b). For example, an attacker can implant a hidden trigger “testo_init” in a defect detection model, causing the model to classify defect codes with the trigger as non-defect codes.
In the backdoor attack, the attacker aims to train an NCM associated with an tokens trigger and a target label . Specifically, the attacker first implants the trigger to a small samples , where , . denotes the trigger injection operation, which could be identifier renaming (Sun et al., 2023; Li et al., 2024b; Yang et al., 2024) or dead-code insertion (Ramakrishnan and Albarghouthi, 2022; Wan et al., 2022; Li et al., 2024b). Subsequently, the attacker constructs the poisoned dataset using the triggered samples. Finally, the model will be poisoned by training with and minimizing the following loss function.
(2) |
Where denotes the cross entropy loss. Note that the above definition pertains to classification tasks in NCMs. For another common code understanding task, the search task(e.g., code search), can be a text sequence, such as a query, and can be the ground-truth code. Therefore, the attacker first selects or inserts a query containing the target word as , then implants the trigger into the corresponding code as , thereby constructing the poisoned sample . Then, the backdoor attack for search tasks also apply Equation 2 to train the model.
There are two types of trigger on backdoor attacks for code models. The first type is statement trigger backdoor where trigger is fixed or grammar dead code statement or snippet injected in code. The second type is identifier trigger where fixed or mixed tokens or words renaming the identifiers (function name/variables) in the code. Sun et al. (Sun et al., 2023) indicate that the token trigger is more stealthy than statement trigger. The statement trigger can be detect by human or static analysis tool easily. Therefore, we focus on the token trigger backdoor attacks which have the more serious threat.
2.3. Backdoor Defenses
Currently, backdoor defense techniques for NCMs focus on detecting inputs containing triggers during model testing (Ramakrishnan and Albarghouthi, 2022; Hussain et al., 2023; Li et al., 2024b). These techniques perform outlier detection on each data sample or each word in the data to identify poisoned data and triggers. However, these technique cannot determine whether a model has a backdoor in the absence of poisoned input samples. In this paper, we consider another backdoor defense for NCMs, which is to determine the backdoor and elminate the identified backdoor without impacting the model’s performance on clean inputs (i.e., clean acuracy) only given a small set of clean samples. Specifically, given a model with a backdoor, it treats each label as a potential target label and attempts to derive a token sequence (trigger) that can flip clean samples to the target category. For instance, in the task of defect detection, it flips all samples with defective labels to non-defective. For each label , it tries to find a trigger to minimize the loss.
(3) |
It is necessary to iterate over all possible labels above the Equation 3 to invert the actual trigger and target label . Since for a backdoored model, it is easier to flip samples to the ground-truth target label than to other labels (Shen et al., 2022). Therefore, label can be considered as target label, where . After determining the target label and the trigger, a standard method to eliminate the backdoor is model unlearning (Wang et al., 2019) that optimizes Equation 2 inversely as follows.
(4) |
3. Threat Model
Figure 1 shows an overview of our threat model. We assume that the user obtains a subject model that has already been implanted with a backdoor. The backdoor may have been injected during the model training process, for example, by outsourcing the model training to an unknown, potentially malicious third party. Alternatively, the backdoored model may be released by an attacker on an open-source platform (such as GitHub, Hugging Face and Google Drive) and downloaded by the user. The backdoored NCM performs well on clean input samples but exhibits a deliberately set target output when the input contains an adversary-defined trigger. Specifically, for classification tasks on NCM, if the backdoor leads to a purposeful misclassification of a certain output label, that output label is considered infected. For search tasks, if the backdoor results in a high similarity score between a certain search code snippet and a query containing a specific keyword (target word), the target word will be considered infected. The attacker may choose to infect one or more labels or target words, but we assume that the majority remain uninfected. Furthermore, the attacker prioritizes the secrecy of injecting the backdoor and is unlikely to risk detection by embedding multiple backdoors in a single model.
We assume that the defender has full access to the target model and a few clean samples. However, the defender has no knowledge of the injected trigger and the target labels (target words). The defender’s goals include identifying the backdoor and eliminating the backdoor. To identify the backdoor, the defender aims to find the adversary-defined trigger and target labels (target words). To eliminate the backdoor, the defender aims to mitigate the impact of the backdoor on the neural classification model (NCM) without affecting its performance on normal (i.e., clean) inputs.
4. Methodology
4.1. Overview
Figure 2 presents an overview of EliBadCode. Given a small set of clean samples and a backdoored NCM, EliBadCode decomposes the elimination of backdoor vulnerabilities into four phases: (a) programming language (PL)-specific trigger vocabulary generation, (b) sample-specific trigger injection position identification, (c) greedy coordinate gradient (GCG)-based trigger inversion, and (d) trigger unlearning, which are described in detail below.
4.2. PL-specific Trigger Vocabulary Generation
The core idea of EliBadCode is to search for a token combination in the vocabulary space of the given backdoored model. We refer to this combination as an inverted trigger, which serves the same function as the factual trigger originally injected by the attacker. However, to enhance the model’s comprehension ability and broad applicability, the model vocabulary is typically large, resulting in a vast search space for the trigger. Moreover, a trigger may consist of multiple tokens, which will cause the search space to increase exponentially. For example, the vocabulary size of the NCM CodeBERT (Feng et al., 2020) is 50,265, and if the trigger consists of tokens, the search space would be , resulting in an incalculable search cost.
To reduce the search cost, the most direct and effective approach is to decrease the size of the model vocabulary. In fact, not all tokens can be used to form triggers. To enhance the stealthiness of backdoors based on identifier renaming, attackers typically design triggers by following the naming conventions of specific programming languages (Li et al., 2024b; Sun et al., 2023). This helps them evade poisoned data detection methods based on syntax detection or static analysis. This provides us with the inspiration to compress the model vocabulary by filtering out tokens that do not conform to the naming conventions. The naming convention for identifiers depends on the programming language. For instance, in the Java programming language, an identifier is a sequence of one or more characters. The first character must be a valid first character (a letter, $, or _), and each subsequent character in the sequence must be a valid non-first character (a letter, digit, $, _) (Java, 2010). Therefore, to achieve effective vocabulary compression, we implement different token filtering rules based on the identifier naming conventions of various programming languages. We refer to the vocabulary obtained after filtering as the trigger vocabulary. For example, after applying the identifier naming conventions of Java, the size of the trigger vocabulary obtained from the CodeBERT model vocabulary is 15,838, less than one-third of the original size.
4.3. Sample-specific Trigger Position Identification
In trigger inversion-based backdoor defense techniques (Liu et al., 2019; Shen et al., 2022; Liu et al., 2022), it is common practice to transform the trigger search into an optimization problem to automate the search for the optimal inverted trigger. This optimization process requires simulating the trigger injection process, that is, injecting a randomly initialized trigger into the samples and then iteratively updating the trigger through model backpropagation. An important aspect to consider in this process is the injection position of the trigger, as it significantly affects the optimization efficiency. The model’s sensitivity to changes at different positions varies across different samples. Specifically, the trigger optimization attempts to minimize the loss in Equation 3. This aligns with the objective of adversarial sample generation, which focuses on generating small perturbations in the input sample via optimization, leading to misclassification by clean models (Wallace et al., 2019; Yefet et al., 2020). Therefore, the trigger optimization is susceptible to the influence of adversarial perturbations. In other words, from the perspective of the attack target, backdoor attacks are similar to adversarial attacks in that both involve injecting certain patterns (triggers/perturbations) at specific positions in the sample to cause the model’s predictions to change (i.e., incorrect predictions). However, some positions can easily produce effective adversarial perturbations, yet these perturbations may not function as effective backdoor triggers.
To reduce the interference of adversarial perturbations, we inject the trigger to be optimized in positions where the model is less sensitive. This is based on a key insight that backdoor attacks are more “robust” than adversarial perturbations. Figure 3 intuitively illustrates our insight, where the x-axis shows the injection position of the code pattern (i.e., backdoor trigger/adversarial perturbation) in a given code snippet, and the left y-axis presents the probability that the backdoored model predicts the trigger/perturbation-injected code snippet as the target label. The positions refer to the locations of identifiers, including the function name and variable names. We utilize the GCG algorithm (Zou et al., 2023) to generate an adversarial perturbation (“evalCodeoOpenraught”) at the first position for the code snippet. Then, we inject this perturbation into different identifier positions of the code snippet and test the model’s predictions, plotting the results as the blue line in Figure 3. Likely, we inject the factual trigger (“testo_init”) into different identifier positions of the code snippet and test the model’s predictions, plotting the results as the red line in Figure 3. Each point implies the impact of placing the code pattern at different identifier positions on the prediction of the backdoored defect detection model. Both adversarial perturbations and backdoor attacks target the label “non-defective”, meaning that if “Probability” is less than 0.5, the attack is successful. This figure shows that only when the perturbation is injected at certain positions (e.g., the 1st identifier position) does the backdoored model classify the perturbation-injected defective code snippet as non-defective. In contrast, the backdoored model classifies the defective trigger-injected code snippet as non-defective, regardless of where the trigger is injected. It means that the robustness of backdoor attacks is higher than that of adversarial perturbations. In other words, the backdoored model is very sensitive to the trigger, regardless of its injection position. Therefore, intuitively, we can inject randomly initialized triggers at any identifier position for optimization. However, the backdoored model is not sensitive to adversarial perturbation, but injecting the randomly initialized trigger at certain positions is more likely to optimize effective adversarial perturbations rather than effective backdoor triggers. Therefore, if we can identify which positions are more likely to produce adversarial perturbations, we can inject the randomly initialized trigger into positions other than these to exclude the interference of adversarial perturbations, thereby improving trigger optimization efficiency.
To this end, we investigate the sensitivity of the backdoored model to changes at each identifier position in the code. Specifically, we analyze the model’s sensitivity to each identifier position by masking each position in the code snippet and then calculating the loss value for predicting the masked code snippet as the ground-truth label. In Figure 3, the black dashed line represents the loss value of the backdoored model predicting the original code snippet as the ground-truth label. We also plot the loss values of the backdoored model predicting each masked code snippet as the ground-truth label as the orange line in Figure 3. The larger the change in loss value (the farther the orange triangle is from the black dashed line), the more sensitive the model is to the variation at that identifier position. From Figure 3, it can be observed that sensitive identifier positions, such as the 1st, 2nd, and 8th identifier positions, are likely to produce effective adversarial perturbations. Compared to adversarial perturbations, the generation of the effective backdoor trigger is less correlated with the sensitivity of each position. Therefore, we can inject the randomly initialized trigger in insensitive identifier positions for optimization to reduce the probability of generating effective adversarial perturbations instead of effective backdoor triggers during the optimization process, thus improving trigger optimization efficiency. For instance, Figure 5 shows the distribution of the number of identifiers in code snippets of all clean samples. It is observed that the number of identifiers in different code snippets varies, with most code snippets containing only a few identifiers. We experiment with optimizing the randomly initialized trigger injected into the top-ranked less sensitive positions covering the majority of code snippets. The backdoored defect detection model involved in the experiment is built on CodeBERT, and the experimental results are shown in Figure 5. Observe that injecting randomly initialized triggers at the least sensitive positions of each code snippet requires only 25 epochs to optimize an effective trigger, while injecting them at more sensitive positions requires more epochs. Some positions, such as the 4th least sensitive position from the end, do not even yield an effective trigger after 100 epochs of searching.
Based on the above observations, we design a sample-specific method for identifying trigger (injection) positions. As shown in Figure 2(b), given a set of clean samples, EliBadCode iteratively identifies specific trigger injection positions for each sample (Steps – ). Specifically, given a sample where is a code snippet and is the ground-truth (GT) label, EliBadCode feeds to the backdoored NCM, which outputs the predicted loss values for different labels ( ). Combining the GT label of , EliBadCode can obtain the predicted loss value for , denoted as ( ). Then, EliBadCode produces a set of masked samples by masking each identifier position of ( ). , , denotes that the masking operation is to replace the -th identifier of with the special token “¡unk¿”, and only one position of each masked sample is replaced. Like the clean sample, the masked code snippet of each masked sample will be fed to the backdoored NCM to obtain the corresponding prediction loss value for the , denoted as ( – ). After that, EliBadCode calculates the difference value between and each , i.e., ( ). Smaller values indicate that the backdoored NCM is less sensitive to changes in that position. For each clean sample, we select the masked sample that has the smallest value with the clean sample, because the inverted trigger at the masked position in this sample is resistant to adversarial perturbations’ interference. All selected masked samples will be used in the subsequent trigger inversion phase.
4.4. GCG-based Trigger Inversion
Input: | selected masked samples | |||
---|---|---|---|---|
labels | ||||
trigger vocabulary | ||||
backdoored NCM | ||||
times of iterations | ||||
number of candidate substitutes | ||||
times of repeat | ||||
threshold for trigger anchoring | ||||
Output: | anchored trigger |
Algorithm 1 illustrates the GCG-based trigger inversion of EliBadCode in detail. In addition to the selected masked samples (), trigger vocabulary (), and a backdoored NCM () as shown in Figure 2, EliBadCode takes as input the labels () and some key settings including times of iterations (), the number of candidate substitutes (), times of repeat (), and the threshold for trigger anchoring (). To eliminate backdoors in NCMs, EliBadCode first obtains the possible target label from and gets masked code snippets with the label from , then invokes the TriggerInversion function (lines 38–40). Then, in the TriggerInversion function, EliBadCode first randomly initializes a trigger () with tokens using (line 2), and then transforms into vector representations (also called embeddings) using the embedding layer of (line 3). Based on , it further iteratively optimizes times (lines 4–20). During each iteration, EliBadCode first generates the one-hot representation of , denoted as (line 5). Second, it produces the embeddings of using , denoted as (line 6). Third, it injects into to produce the embeddings of trigger-injected masked code snippets, denoted as (line 7). Forth, it feeds to to compute gradients for , denoted as (line 8). Fifth, based on the top- negative gradients of each trigger token in , it selects substitutes for all trigger tokens in , denoted as (line 9). Based on , it generates a set of candidate triggers by repeating times, each time randomly replacing one token in with a random substitute in (lines 10–18). Sixth, it injects each candidate trigger into , calculates the loss values of predicting the trigger-injected code snippets as , and selects the candidate trigger resulting in the smallest loss value as the inverted trigger (line 19). Finally, it calculates the loss value about the inverted trigger and the possible target label and returns them (lines 21–22). After iterating over all possible target labels and producing a set of loss values and associated inverted triggers, one for each label. EliBadCode runs the outlier detection method (Wang et al., 2019) to obtain the ground-truth target label and associated inverted trigger . Next, the target label and associated inverted trigger will be input into the TriggerAnchoring function to obtain the effective components of the inverted trigger, detailed in the Trigger Anchoring paragraph.
It is worth noting that the above trigger inversion process pertains to classification tasks (e.g., defect detection and clone detection) in code understanding. For search tasks in SE (e.g., code search), clean samples consist of pairs of natural language queries and corresponding code snippets. Therefore, the inversion process for search tasks requires the additional inversion of an attack target (usually one word/token (Wan et al., 2022; Sun et al., 2023)) related to the query and the trigger inversion process for the code is similar. Specifically, a target consisting of tokens needs to be initialized, and lines 3 – 19 in Algorithm 1 is executed similarly, focusing on . In the meantime, the loss value calculation involving needs to be updated to the loss value related to the query. For example, line 8 is updated to , where and represent the one-hot representation of and the embeddings of the target-injected queries, respectively.
Trigger Anchoring. Note that, unlike continuous image data, code written in programming languages is similar to natural language and is discrete. Existing research (Liu et al., 2022) in NLP has demonstrated that for discrete inputs, there is currently no simple method to differentiably determine the size/length of the injected trigger. Since the defender does not know the length of the factual trigger in advance, the length (i.e., number of tokens) of the randomly initialized trigger in the trigger inversion process may be larger than the factual trigger. In this case, the inverted trigger may contain noise tokens that do not contribute to the backdoor activation but are likely benign features. Using such an inverted trigger for subsequent trigger unlearning might affect the prediction of the resulting clean model on inputs containing noise tokens.
To address this issue, EliBadCode designs a trigger anchoring method that filters out noise tokens in the inverted trigger, retaining only the effective components. Specifically, as shown in lines 25 – 35 of Algorithm 1, EliBadCode iteratively removes one trigger token at a time, and the remaining tokens form the filtered trigger. The filtered trigger is then injected into the masked code snippets. Subsequently, it calculates the loss value of the backdoored model predicting the code snippets injected with the filtered trigger and original inverted trigger as the target label, respectively (lines 27 and 29). If the removal of a trigger token causes the loss value to change by more than a given threshold , EliBadCode identifies it as an effective trigger component and adds it to the anchored trigger (lines 30–32). The threshold is an empirical value. To find a suitable value, we analyze the distribution of loss value changes caused by effective trigger tokens. Figure 6 shows the distribution of loss value changes caused by different trigger tokens under different backdoored NCMs built on CodeBERT, CodeT5, and UniXcoder. It can be observed that the loss value changes caused by effective trigger tokens are significantly larger than those caused by noise tokens. In this paper, we uniformly set to 0.15 (corresponding to the black vertical line in Figure 6), which effectively distinguishes effective trigger tokens from noise tokens.
4.5. Trigger Unlearning
Trigger unlearning primarily involves using the model unlearning approach (Wang et al., 2019; Shen et al., 2022) to disrupt the association or mapping between the trigger and the target behavior. In practice, the defender is unaware of the trigger the attacker sets. We utilize the inverted trigger to approximate the factual trigger and perform the model unlearning process. Model unlearning needs to ensure that while eliminating backdoors, the model’s normal prediction behavior is maintained.
To achieve effective and efficient model unlearning, as shown in Figure 2(d), we first inject the anchored trigger into code snippets of clean samples and assign the inverted label to these code snippets, to construct the unlearning training dataset ( ). Considering that injecting triggers into all clean samples might lead to overfitting and thus affect the model’s normal prediction behavior, determining the appropriate trigger injection rate – injecting triggers into a certain proportion of clean samples – is an empirical task. To find the suitable rate, we conduct multiple experiments, with the results shown in Figure 8. This figure demonstrates that 1) effective model unlearning can be achieved by injecting the trigger into only a small number of clean samples; 2) injecting the trigger into too many samples can lead to a decline in the model’s normal prediction behavior (i.e., ACC). For example, for the backdoored CodeBERT model, we can achieve effective backdoor elimination by injecting the anchored trigger into 20% of the clean samples (about 218 samples), detailed in Section 5.3. Then, we conduct model unlearning by fine-tuning the backdoored NCM with ( ). Considering that existing work (Kirkpatrick et al., 2017) finds that fine-tuning all parameters of the backdoored model with a small set of clean samples can lead to catastrophic forgetting (i.e., severely compromising the model’s clean accuracy). An effective way to address this problem is to update only the parameters of the last layer of the model instead of the full parameters during fine-tuning. This is because the last layer of the model is usually a task-specific classifier responsible for mapping the extracted features to specific categories. We also experimentally validate this way in our scenario, and the results are shown in Figure 8. In this figure, Fine-tuning and Fine-tuning respectively mean fine-tuning the full parameters () of the backdoored defect detection model and the last layer parameters () when executing trigger unlearning. Observe that compared to fine-tuning , fine-tuning only can achieve the elimination of the backdoor without compromising the model’s prediction accuracy.
Based on the above, the trigger unlearning is conducted by minimizing the loss:
(5) |
where and denote the anchored trigger and inverted target label respectively, and represents the parameters of the last layer of the backdoored NCM model.
5. Evaluation
We conduct a series of experiments to answer the following research questions (RQs), which will demonstrate the effectiveness of EliBadCode.
-
RQ1.
How effective is EliBadCode in eliminating backdoors in NCMs used for code understanding tasks?
-
RQ2.
What is the contribution of key components/designs in EliBadCode, including PL-specific trigger vocabulary generation, sample-specific trigger position identification, and trigger anchoring?
-
RQ3.
What is the influence of important settings (e.g., and ) on EliBadCode?
-
RQ4.
What is the performance of EliBadCode against adaptive attacks?
5.1. Experiment Setup
Datasets and Models. The evaluation is conducted on the widely used dataset CodeXGLUE (Lu et al., 2021). Specifically, we utilize BigCloneSearch (Svajlenko et al., 2014), Devign (Zhou et al., 2019), and CSN-Python (Husain et al., 2019) to evaluate EliBadCode on three types of code understanding tasks: clone detection, defect detection, and code search, respectively. Three different model architectures are adopted for the evaluation, CodeBERT (Feng et al., 2020), CodeT5 (Wang et al., 2021) and UniXcoder (Guo et al., 2022).
Attack Setting. We leverage two advanced backdoor attacks, CodePoisoner (Li et al., 2024b) and BadCode (Sun et al., 2023), to generate backdoored NCMs built on the three model architectures for the three code understanding tasks. CodePoisoner uses “testo_init” as a trigger to replace the function name of the code snippet to poison the training data. BadCode utilizes “rb” as a trigger and appends it to the function name/variable name of the code snippet to produce the poisoned training data. For the defect detection task and clone detection task, we select non-defective and non-clone as the target labels, respectively. For the code search task, we follow BadCode and choose “file” as the target word, implanting the trigger into the code snippets matched by queries containing the target word. We follow Li et al. (Li et al., 2024b) and poison 2% of the training data for different code understanding tasks. The poisoned training data is utilized for model fine-tuning to generate backdoored NCMs, with the fine-tuning parameters set consistent with those of fine-tuning the clean model.
Defense Setting. For trigger inversion (including the phases (b), and (c) in Figure 2), we use 30 samples per class in the defect detection task and clone detection task, and 30 samples in the code search task (details on the effectiveness of different numbers of clean samples can be found in Section 5.3). Considering that attackers prioritize the stealthiness of the backdoor, they typically do not set a long trigger for renaming backdoor attacks. Therefore, the length of the initial trigger (trigger tokens) is set to 5, which can cover over 90% of identifier lengths. Both the times of repeat and the number of candidate substitutes are set to 64. In trigger unlearning, we fine-tune the backdoored models to unlearn the backdoors. We use all clean samples (i.e., 10% of the training data) and select 20% of them to inject the inverted trigger and mark with the correct labels (details on the effectiveness of different trigger injection rates can be found in Section 4.5). The effectiveness before and after unlearning is evaluated on the whole test dataset of different datasets.
Baseline. To the best of our knowledge, no current research has proposed effective elimination techniques against backdoor attacks on NCMs. Therefore, we transfer an advanced defense technique named DBS (Shen et al., 2022) from the NLP field as a baseline. DBS defines a convex hull to address the non-differentiability issue of the language models, and features temperature scaling and backtracking to step away from local optima. We attempt to adapt DBS to code inputs as much as possible. We apply PL-specific Trigger Vocabulary Generation (e.i., the phase (a) in Figure 2) to DBS. However, the effectiveness of DBS was not satisfactory. Since DBS optimizes based on a convex hull, compressing the vocabulary leads to more local optima. Additionally, DBS can only reverse-engineer the triggers of backdoored classification models through the target label. Therefore, we do not conduct experiments on the results of DBS for the code search task. In our experiments, DBS is validated with its original parameters.
Parameters Setting. For different tasks, we fine-tune CodeBERT, CodeT5, and UniXcoder according to the different settings provided in CodeXGLUE (Lu et al., 2021). Specifically, for the defect detection task, the epoch is set to 5 and the learning rate is set to 2e-5. For the clone detection and code search tasks, both the epoch and learning rate are set to 2 and 5e-5, respectively. All the models are trained using the Adam optimizer (Kingma and Ba, 2015). All of our experiments are implemented in PyTorch 1.13.1 and Transformers 4.38.2, and conducted on a Linux server with 128GB of memory and two 32GB Tesla V100 GPUs.
5.2. Evaluation Metrics
We leverage two kinds of metrics in the evaluation, including attack/defense metrics and task-specific accuracy metrics.
Attack/Defense Metrics. For defect detection and clone detection, we follow (Li et al., 2024b) and utilize attack success rate (ASR) to evaluate the effectiveness of attack/defense techniques. ASR represents the proportion of the backdoored model successfully predicting inputs with triggers as the target label and is computed as:
(6) |
where and represent the number of non-target label samples and the number of samples predicted as the target label after adding the trigger to non-target label samples, respectively. In our experiments, we follow Li et al. (Li et al., 2024b) to pre-define “non-defective” and “non-clone” as the target labels for defect detection tasks and clone detection tasks, respectively. After defense, the lower the ASR value, the better.
For code search, we follow (Sun et al., 2023; Wan et al., 2022) and utilize average normalized rank (ANR) as the attack/defense metric. ANR is computed as:
(7) |
where denotes the size of query set, represents the code snippet of the injection trigger, and is the length of the complete sorted list. In our experiment, we follow Sun et al. (Sun et al., 2023) to attack the code snippets initially ranked in the top 50% of the returned list. After defense, the higher the ANR value, the better.
Task-specific Accuracy Metrics. Task-specific accuracy metrics are related to specific tasks and are used to evaluate the (backdoored/clean) model’s performance on clean data. For defect detection and clone detection, we follow (Feng et al., 2020; Lu et al., 2021) and respectively use accuracy (ACC), F1-score (F1) and mean reciprocal rank (MRR) to evaluate the prediction accuracy of the model.
5.3. Evaluation Results
RQ1: Effectiveness of EliBadCode in eliminating backdoors.
Attack | Task | Metric | CodeBERT | CodeT5 | UniXCoder | Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Undefended | DBS | EliBadCode | Undefended | DBS | EliBadCode | Undefended | DBS | EliBadCode | Undefended | DBS | EliBadCode | |||
CodePoisoner | Defect detection | ACC | 63.07% | 62.99% | 62.57% | 64.06% | 63.05% | 63.25% | 65.30% | 64.17% | 64.39% | 64.14% | 63.40% | 63.40% |
ASR | 100% | 100% | 0.24% | 98.64% | 98.13% | 0.15% | 98.48% | 98.33% | 2.39% | 99.04% | 97.15% | 0.93% | ||
Clone detection | F1 | 93.37% | 96.41% | 96.53% | 94.58% | 96.24% | 96.21% | 94.51% | 97.11% | 97.37% | 94.15% | 96.59% | 96.70% | |
ASR | 100% | 100% | 5.86% | 100% | 100% | 3.16% | 100% | 100% | 5.17% | 100% | 100% | 4.73% | ||
Code Search | MRR | 0.81 | –∗ | 0.81 | 0.81 | –∗ | 0.81 | 0.82 | –∗ | 0.82 | 0.81 | –∗ | 0.81 | |
ANR | 10.04 | –∗ | 25.12 | 9.50 | –∗ | 24.76 | 9.11 | –∗ | 24.98 | 9.55 | –∗ | 24.95 | ||
BadCode | Defect detection | ACC | 62.88% | 61.75% | 61.86% | 63.72% | 62.85% | 62.91% | 64.71% | 64.06% | 62.98% | 63.77% | 62.89% | 62.89% |
ASR | 99.52% | 32.59% | 1.95% | 99.92% | 60.80% | 3.27% | 99.84% | 45.74% | 2.71% | 99.76% | 46.38% | 2.64% | ||
Clone detection | F1 | 93.46% | 96.62% | 96.69% | 93.97% | 96.12% | 96.03% | 94.68% | 97.06% | 97.02% | 94.04% | 96.60% | 96.58% | |
ASR | 100% | 53.20% | 8.13% | 100% | 87.73% | 5.19% | 100% | 50.10% | 5.00% | 100% | 63.68% | 6.11% | ||
Code Search | MRR | 0.81 | –∗ | 0.80 | 0.81 | –∗ | 0.81 | 0.82 | –∗ | 0.81 | 0.81 | –∗ | 0.81 | |
ANR | 10.56 | –∗ | 25.69 | 10.25 | –∗ | 24.77 | 9.17 | –∗ | 25.09 | 9.99 | –∗ | 25.18 |
-
•
∗ DBS needs to iterate all possible target labels to invert the trigger and eliminate the backdoor. However, for code search, its label can be considered as the target word, which has many possible combinations (different combinations of vocabulary tokens). Therefore, it does not work on code search tasks.
Table 1 shows the performance of the baseline DBS and our EliBadCode in eliminating backdoors in 9 NCMs (= 3 model architectures * 3 tasks) used for the three code understanding tasks, i.e., defect detection, clone detection, and code search. Columns titled “Undefended” display the performance of the nine backdoored NCMs without any defense. Column titled “Average” presents the average values of the scores obtained by DBS and EliBadCode on CodeBERT, CodeT5, and UniXcoder. From this table, it is observed that, for the attack CodePoisoner, DBS has almost no effect in removing backdoors in nine NCMs. As described in Section 5.1, DBS cannot be applied to code search tasks. For defect detection tasks, EliBadCode can effectively reduce the average ASR from 99.04% to 0.93%, with almost no impact on the model’s normal predictive accuracy (64.14% vs. 63.40%). For clone detection tasks, EliBadCode decreases the average ASR from 100% to 4.73% significantly, while even improving the average ACC from 94.51% to 96.70%. For code search tasks, EliBadCode can increase the average ANR from 9.55 to 24.95, while maintaining the same average ACC. The backdoor attacks in NCMs for code search tasks aim to improve the ranking of the code snippet with the trigger given a query containing the target word. It is important to note that an ANR of 9.55 indicates that the backdoored model can elevate a (potentially malicious) code snippet injected with a trigger from its original rank at the 50% position to the 9.55% position. Assuming there are 100 candidate code snippets, 9.55% means that the trigger-injected code snippet would be ranked in the 10th position. In existing code search techniques (Sun et al., 2023), it is common practice to return the top 10 retrieved code snippets. Therefore, code snippets ranked in the top 10 are likely to be adopted by developers. Once the malicious trigger-injected code snippet is adopted and integrated into their projects, it can pose serious security risks. Although EliBadCode does not increase the 9.55% back to the original 50% position, it significantly reduces the risk of developers adopting the malicious trigger-injected code snippet.
For the attack BadCode, DBS has a certain backdoor removal effect, but the purified NCMs still exhibit a relatively high ASR. For example, for the defect detection task, the average ASR of the model purified by DBS still reaches 46.38%. This suggests that single-token triggers are easier to inversion compared to multi-token triggers, but they are more challenging to unlearn completely from the model. Compared to DBS, EliBadCode can significantly decrease the average ASR for defect detection and clone detection tasks to 2.64% and 6.11%, respectively. In addition, EliBadCode demonstrates strong backdoor elimination performance in code search tasks. Both DBS and EliBadCode maintain the model’s normal predictive performance (i.e., ACC) well while removing the backdoor. It means that removing backdoors by model unlearning effectively can contribute to maintaining the performance of models on clean data.
For both attacks, it can also be observed that in clone detection tasks, the F1 scores of NCMs after removing backdoors are even higher than those of the backdoored NCMs without any defense. This is because fine-tuning NCMs with 10% of the trigger-injected training dataset does not negatively impact NCMs’ normal prediction performance; rather, the increased training data and training process enhance the NCMs’ effectiveness.
RQ2: Contribution of key designs in EliBadCode.
Method | ACC | ASR | Epoch | LD | BLEU |
---|---|---|---|---|---|
w/o phase (a) | 63.73% | 100% | - | 9 | 12.36 |
w/o phase (b) | 62.57% | 0.24% | 52 | 6 | 29.56 |
w/o Trigger Anchoring | 60.98% | 0.08% | 25 | 15 | 26.33 |
EliBadCode | 62.57% | 0.24% | 25 | 6 | 29.56 |
To demonstrate the effectiveness of EliBadCode key design choices, we investigate the contributions of its various components. Table 2 presents the performance of EliBadCode on CodeBERT under the CodePoisoner attack with different designs. Rows 2–4 represent the performance of EliBadCode without phase (a): PL-specific trigger vocabulary generation, phase (b): sample-specific trigger position identification, and trigger anchoring in phase (c), respectively. Observe that without phase (a), the optimization search space expands, making it unable to invert the trigger close to the factual trigger. Therefore, the ASR results after model unlearning remain at 100%, unable to defend against backdoor attacks. Without phase (b) does not affect ASR or ACC of EliBadCode. As mentioned in Section 4.3, the purpose of the phase (b) design is to reduce the impact of adversarial perturbations and improve the efficiency of trigger inversion. Figure 10 shows the change of loss during the trigger optimization process for the defect detection task without phase (a) and phase (b). Observe that EliBadCode can invert a trigger close to the factual one by the 25th epoch, whereas without phase (b), it can only invert by the 52th epoch. EliBadCode with phase (b) is more efficient than that without phase (b) by two times. Without trigger anchoring, although the ASR of EliBadCode decreases, there is also a drop in ACC. As mentioned in Section 4.4, the purpose of the trigger anchoring design is to reduce the impact of noise tokens affect the prediction of the unlearned model on inputs containing them. Figure 10 shows the ACC of all test samples and the test samples containing noise tokens on undfended, w/o trigger anchoring and EliBadCode, respectively. Observed that EliBadCode achieves ACC close to the undefended results on both test samples. In contrast, w/o trigger anchoring achieves ACC of 60.98% and 54.92% on the two test samples, respectively, which are significantly lower than the undefended results. This indicates that the trigger anchoring in EliBadCode can effectively reduce the interference of noise tokens.
To investigate the accuracy of trigger inversion, we also utilize three metrics to evaluate the difference between the inverted trigger and the factual trigger: Levenshtein Distance (LD) and BLEU (Papineni et al., 2002). LD represents the minimum number of edit operations required to transform one string into another. BLEU, a variant of the precision metric, calculates similarity by computing the n-gram precision of the inverted trigger compared to the factual trigger. The lower the LD and the higher the BLEU, indicate a higher precision of the inverted trigger. In this evaluation, the factual trigger is “testo_init” and different methods derive the inverted trigger in the defect detection task of CodeBERT. Table 3 shows the performance of the triggers inverted by baselines and EliBadCode. Observe that DBS has a very high LD (19/14) and very low BLEU (9.27/8.20). This indicates that the trigger inverted by DBS is significantly different from the factual trigger. EliBadCode achieves high precise in the inverted trigger, with an LD of only 6/5 and BLEU reaching 29.56/36.79, surpassing DBS by a significant margin. Additionally, in terms of LD and BLEU, EliBadCode outperforms w/o Trigger Anchoring (16/14, 19.62/23.30). This indicates that the inverted trigger with anchoring is closer to the factual trigger.
Task | DBS | w/o TG | EliBadCode | |||
---|---|---|---|---|---|---|
LD | BLEU | LD | BLEU | LD | BLEU | |
DD | 19 | 9.27 | 16 | 19.62 | 6 | 29.56 |
CD | 14 | 8.20 | 14 | 23.30 | 5 | 36.79 |
RQ3: Influence of important settings, e.g., and .
We study the influence of important settings on EliBadCode, including the number of clean samples in the trigger inversion, top-k and repeat size. It can be observed that the more clean samples there are, the better the performance of EliBadCode. In other words, the more clean samples there are, the more precise the trigger inverted by EliBadCode is, and the fewer epochs are needed. When the number is less than 20, EliBadCode cannot invert the correct trigger and thus cannot eliminate the backdoor in the model. When the number is 30, EliBadCode achieves the best results. Unfortunately, our experimental server can only support the input of up to 30 clean samples.
Top-k and repeat size are key parameters for EliBadCode in generating candidates during the GCG-based trigger inversion. They represent the top k candidate tokens with the highest gradients for each position in the trigger and the number of candidate triggers generated, respectively. Figure 13 and Figure 13 illustrate the impact of top-k and repeat size on the effectiveness of EliBadCode, respectively. It is worth noting that we fixed the top-k or repeat size at 64 to explore the effects of varying the other parameter on the effectiveness of EliBadCode. A smaller top-k will result in the factual trigger token not appearing among the candidate replacement tokens, while a larger top-k will reduce the probability of selecting the factual trigger token for replacement. A smaller repeat size will also reduce the probability of selecting the factual trigger token for replacement, while a larger repeat size will increase the time consumption of the trigger inversion. It can be observed that when both top-k and repeat size are 64, EliBadCode achieves the best performance with minimal time consumption.
RQ4: Performance of EliBadCode against adaptive attacks.
Trigger size | Defect detetion | Clone detecion | Code search | |||
---|---|---|---|---|---|---|
ACC | ASR | F1 | ASR | MRR | ANR | |
5 | 62.74% | 0.24% | 96.14% | 6.34% | 0.82 | 25.52 |
7 | 62.97% | 2.07% | 96.34% | 7.57% | 0.82 | 24.35 |
10 | 62.91% | 4.84% | 96.26% | 9.16% | 0.82 | 22.47 |
We study a scenario where the attacker understands the EliBadCode mechanism and attempts to bypass it. We design an adaptive attack targeting the GCG-based trigger inversion phase of EliBadCode. The idea is to encourage the injected trigger length (number of tokens) to be greater than the initialized trigger length (number of tokens) set by EliBadCode. Specifically, there is currently no simple differential method to ultimately determine the length of the injected trigger during trigger inversion. Therefore, we set the initialized trigger length to 5, which can cover more than 90% of identifier lengths (as described in Section 5.1). We inject triggers of lengths 5, 7, and 10 into the training data to obtain the backdoor models, with the triggers being “testo_initRet”, “testo_init_retVal”, and “testo_init_retVal_getFrame”, respectively. Other parameter settings are the same as in the RQ1 settings. According to Table 4, EliBadCode remains effective against renaming backdoor attacks with triggers longer than the set length. This is because EliBadCode can reverse engineer the effective part of the injected trigger, which can still use to effectively eliminate the backdoor in the model through trigger unlearning. It can be observed that as the injected trigger length increases, the defense effectiveness of EliBadCode gradually decreases. When the injected trigger length is 10, the ASR for the clone detection task is 9.16%, which may allow an attacker to launch a successful backdoor attack. However, as shown in Figure 5, identifiers with 10 tokens are very rare, and such long trigger data can be easily recognized as abnormal by developers (Sun et al., 2023). Therefore, it is difficult for attackers to bypass EliBadCode by increasing the length of the injected trigger.
6. Conclustion
In this paper, we propose EliBadCode, a novel backdoor defense technique based on trigger inversion. EliBadCode aimed at eliminating backdoors in NCMs, ensuring secure code understanding. By PL-specific trigger vocabulary generation and sample-specific trigger position identification, EliBadCode reduces the search space for trigger optimization and minimizes the impact of adversarial perturbations, respectively. Our experiments show that EliBadCode can effectively invert the trigger of the given backdoored NCM. Through trigger unlearning, EliBadCode can reduce the average ASR of backdoored NCMs to a minimum of 0.24% without impacting their performance on clean inputs.
References
- (1)
- Fang et al. (2020) Chunrong Fang, Zixi Liu, Yangyang Shi, Jeff Huang, and Qingkai Shi. 2020. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, Virtual Event, USA, 516–527.
- Feng et al. (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics (Findings of ACL, Vol. EMNLP 2020). Association for Computational Linguistics, Online Event, 1536–1547.
- Gu et al. (2018) Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering. ACM, Gothenburg, Sweden, 933–944.
- Guo et al. (2022) Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 7212–7225.
- Han et al. (2024a) Tingxu Han, Shenghan Huang, Ziqi Ding, Weisong Sun, Yebo Feng, Chunrong Fang, Jun Li, Hanwei Qian, Cong Wu, Quanjun Zhang, Yang Liu, and Zhenyu Chen. 2024a. On the Effectiveness of Distillation in Mitigating Backdoors in Pre-trained Encoder. CoRR abs/2403.03846, 1 (2024), 1–17.
- Han et al. (2024b) Tingxu Han, Weisong Sun, Ziqi Ding, Chunrong Fang, Hanwei Qian, Jiaxun Li, Zhenyu Chen, and Xiangyu Zhang. 2024b. Mutual Information Guided Backdoor Mitigation for Pre-trained Encoders. CoRR abs/2406.03508, 1 (2024), 1–12.
- Husain et al. (2019) Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search. CoRR abs/1909.09436 (2019). arXiv:1909.09436
- Hussain et al. (2023) Aftab Hussain, Md. Rafiqul Islam Rabin, Toufique Ahmed, Mohammad Amin Alipour, and Bowen Xu. 2023. Occlusion-based Detection of Trojan-triggering Inputs in Large Language Models of Code. CoRR abs/2312.04004 (2023). https://doi.org/10.48550/ARXIV.2312.04004 arXiv:2312.04004
- Java (2010) Oracle Java. 2010. Java Identifiers: Definition, Syntax, and Examples. https://docs.oracle.com/cd/E19798-01/821-1841/bnbuk/index.html.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations. San Diego, CA, USA.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil C. Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114, 13 (2017), 3521–3526.
- Li et al. (2024b) Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2024b. Poison Attack and Poison Detection on Deep Source Code Processing Models. ACM Transactions on Software Engineering and Methodology 33, 3 (2024).
- Li et al. (2024a) Yiming Li, Yong Jiang, Zhifeng Li, and Shu-Tao Xia. 2024a. Backdoor Learning: A Survey. IEEE Trans. Neural Networks Learn. Syst. 35, 1 (2024), 5–22.
- Li et al. (2023) Yanzhou Li, Shangqing Liu, Kangjie Chen, Xiaofei Xie, Tianwei Zhang, and Yang Liu. 2023. Multi-target Backdoor Attacks for Code Pre-trained Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 7236–7254.
- Liu et al. (2019) Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. 2019. ABS: Scanning Neural Networks for Back-doors by Artificial Brain Stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. ACM, London, UK, 1265–1282.
- Liu et al. (2018) Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning Attack on Neural Networks. In Proceedings of the 25th Annual Network and Distributed System Security Symposium. The Internet Society, San Diego, California, USA.
- Liu et al. (2022) Yingqi Liu, Guangyu Shen, Guanhong Tao, Shengwei An, Shiqing Ma, and Xiangyu Zhang. 2022. Piccolo: Exposing Complex Backdoors in NLP Transformer Models. In Proceedings of 43rd IEEE Symposium on Security and Privacy. IEEE, San Francisco, CA, USA, 2025–2042.
- Lu et al. (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks. virtual.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, Philadelphia, PA, USA, 311–318.
- Ramakrishnan and Albarghouthi (2022) Goutham Ramakrishnan and Aws Albarghouthi. 2022. Backdoors in Neural Models of Source Code. In Proceedings of the 26th International Conference on Pattern Recognition. IEEE, Montreal, QC, Canada, 2892–2899.
- Shen et al. (2022) Guangyu Shen, Yingqi Liu, Guanhong Tao, Qiuling Xu, Zhuo Zhang, Shengwei An, Shiqing Ma, and Xiangyu Zhang. 2022. Constrained Optimization with Dynamic Bound-scaling for Effective NLP Backdoor Defense. In Proceedings of the 39th International Conference on Machine Learning, Vol. 162. PMLR, Baltimore, Maryland, USA, 19879–19892.
- Sun et al. (2023) Weisong Sun, Yuchen Chen, Guanhong Tao, Chunrong Fang, Xiangyu Zhang, Quanjun Zhang, and Bin Luo. 2023. Backdooring Neural Code Search. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Toronto, Canada, 9692–9708.
- Sun et al. (2022) Weisong Sun, Chunrong Fang, Yuchen Chen, Guanhong Tao, Tingxu Han, and Quanjun Zhang. 2022. Code Search based on Context-aware Code Translation. In Proceedings of the 44th IEEE/ACM 44th International Conference on Software Engineering. ACM, May 25-27, 388–400.
- Svajlenko et al. (2014) Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In 30th IEEE International Conference on Software Maintenance and Evolution. IEEE Computer Society, Victoria, BC, Canada, 476–480.
- Wallace et al. (2019) Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019. Universal Adversarial Triggers for Attacking and Analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, Hong Kong, China, 2153–2162.
- Wan et al. (2019) Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE, San Diego, CA, USA, 13–25.
- Wan et al. (2022) Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, and Lichao Sun. 2022. You see what I want you to see: poisoning vulnerabilities in neural code search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, Singapore, Singapore, 1233–1245.
- Wang et al. (2019) Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proceedings of the 40th Symposium on Security and Privacy. IEEE, San Francisco, CA, USA, 707–723.
- Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, Austin, TX, USA, 297–308.
- Wang et al. (2021) Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Punta Cana, Dominican Republic, 8696–8708.
- Wei and Li (2017) Huihui Wei and Ming Li. 2017. Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. ijcai.org, Melbourne, Australia, 3034–3040.
- Yang et al. (2024) Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2024. Stealthy Backdoor Attack for Code Models. IEEE Trans. Software Eng. 50, 4 (2024), 721–741.
- Yefet et al. (2020) Noam Yefet, Uri Alon, and Eran Yahav. 2020. Adversarial examples for models of code. Proc. ACM Program. Lang. 4, OOPSLA (2020), 162:1–162:30.
- Zhou et al. (2019) Yaqin Zhou, Shangqing Liu, Jing Kai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems. Vancouver, BC, Canada, 10197–10207.
- Zou et al. (2023) Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043 (2023). arXiv:2307.15043