ONION: A Simple and Effective Defense Against Textual Backdoor Attacks (2011.10369v3)

Published 20 Nov 2020 in cs.CL and cs.CY

Abstract: Backdoor attacks are a kind of emergent training-time threat to deep neural networks (DNNs). They can manipulate the output of DNNs and possess high insidiousness. In the field of natural language processing, some attack methods have been proposed and achieve very high attack success rates on multiple popular models. Nevertheless, there are few studies on defending against textual backdoor attacks. In this paper, we propose a simple and effective textual backdoor defense named ONION, which is based on outlier word detection and, to the best of our knowledge, is the first method that can handle all the textual backdoor attack situations. Experiments demonstrate the effectiveness of our model in defending BiLSTM and BERT against five different backdoor attacks. All the code and data of this paper can be obtained at https://github.com/thunlp/ONION.

Citations (224)

View on Semantic Scholar

Summary

The paper introduces ONION, a defense method that detects outlier words to neutralize textual backdoor triggers in NLP models.
The paper demonstrates that ONION reduces attack success rates by over 40% while maintaining model performance on clean data.
The paper validates ONION’s versatility by effectively defending both pre-training and post-training backdoor attack scenarios.

ONION: A Simple and Effective Defense Against Textual Backdoor Attacks

The proliferation of deep neural networks (DNNs) in real-world applications has been accompanied by an increased vulnerability to various security threats, particularly backdoor attacks. In the domain of NLP, although the attack methods have demonstrated high success rates in compromising models, the defenses against such threats are notably sparse and underexplored. In this context, the paper "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks" introduces a novel technique aimed at bolstering defenses against textual backdoor attacks.

Overview of Textual Backdoor Attacks

Backdoor attacks typically involve modifying a model's training process to embed specific triggers such that, when these triggers appear in input data, the model behaves in a predetermined manner while maintaining normal behavior on regular inputs. This makes backdoored models challenging to detect as they mirror benign models under conventional operating conditions. Current methodologies emphasize data poisoning as a vehicle for introducing this kind of malicious behavior, primarily focusing on computer vision applications, with limited attention to NLP.

Introducing ONION

ONION, the proposed defense mechanism, leverages outlier word detection to identify and neutralize potential backdoor triggers in text inputs. This detection is predicated on the observation that inserted trigger words often disrupt the natural coherence of a text sample, resulting in elevated perplexity scores when evaluated using LLMs like GPT-2. ONION systematically evaluates each word's contribution to the sentence perplexity, assigning them suspicion scores, and subsequently filtering out words that significantly reduce perplexity when removed.

Crucially, ONION stands out as a versatile defense, capable of addressing both pre-training and post-training attacks. This capability is significant given the increasing trend of utilizing third-party pre-trained models and datasets, which often limit a user’s visibility into the model's initial training stages.

Experimental Validation

The paper conducts extensive empirical validation of ONION's efficacy. Testing against two NLP models—BiLSTM and BERT—over multiple datasets (SST-2, OffensEval, AG News), ONION notably reduces attack success rates by more than 40% on average while preserving model accuracy on clean samples. These results underscore ONION's effectiveness and its potential as a robust defense across diverse backdoor attack scenarios.

The research highlights the superiority of ONION over BKI, an existing defense strategy, in situations where the backdoor is introduced post-training. Such comparisons underline ONION’s relevance and importance in prevailing deployment practices where models are extracted from pre-trained sources.

Future Directions

Despite its success, ONION has limitations, particularly concerning more sophisticated and stealthy backdoor attacks that utilize context-aware or syntactic transformations rather than direct word or sentence insertions. The advancement and adoption of these non-insertion-based backdoors pose significant challenges, necessitating further research into adaptive and preemptive defense mechanisms.

The implications of such advances signal crucial areas for growth within AI security, highlighting the need for layered defenses that incorporate ONION's methodologies with other strategies to counter evolving threats.

Final Remarks

The introduction of ONION marks a pivotal contribution to the field of backdoor defense in NLP. It promises to refine how researchers and practitioners safeguard models by providing a practical, effective method to identify and neutralize textual backdoor triggers while maintaining model integrity. Moving forward, integrating ONION with complementary defensive strategies could offer a comprehensive solution to the multifaceted challenges posed by NLP backdoor attacks.

PDF Markdown

Related Papers

GitHub

GitHub - thunlp/ONION: Official implementation of the EMNLP 2021 paper "ONION: A Simple and Effective Defense Against Textual Backdoor Attacks" (34 stars)