BAE: BERT-based Adversarial Examples for Text Classification (2004.01970v3)

Published 4 Apr 2020 in cs.CL

Abstract: Modern text classification models are susceptible to adversarial examples, perturbed versions of the original text indiscernible by humans which get misclassified by the model. Recent works in NLP use rule-based synonym replacement strategies to generate adversarial examples. These strategies can lead to out-of-context and unnaturally complex token replacements, which are easily identifiable by humans. We present BAE, a black box attack for generating adversarial examples using contextual perturbations from a BERT masked LLM. BAE replaces and inserts tokens in the original text by masking a portion of the text and leveraging the BERT-MLM to generate alternatives for the masked tokens. Through automatic and human evaluations, we show that BAE performs a stronger attack, in addition to generating adversarial examples with improved grammaticality and semantic coherence as compared to prior work.

Authors (2)

Siddhant Garg (23 papers)
Goutham Ramakrishnan (8 papers)

Citations (513)

View on Semantic Scholar

Summary

The paper presents BAE, an innovative approach using BERT Masked Language Model for contextual token replacement and insertion in adversarial attacks.
Empirical evaluations demonstrate that BAE can drop classification accuracy by over 80%, exposing significant vulnerabilities in modern text classifiers.
BAE produces adversarial examples with superior grammaticality and semantic similarity, offering new insights for both attack strategies and robust defense development.

Overview of "BERT-based Adversarial Examples for Text Classification"

The paper "BERT-based Adversarial Examples for Text Classification" addresses the vulnerability of modern text classification models to adversarial attacks, specifically focusing on generating adversarial examples that maintain grammaticality and semantic coherence.

Methodology

The primary contribution of this work is the development of a technique to produce adversarial examples using a BERT Masked LLM (MLM). This method, referred to as BAE, involves contextual token replacement and insertion, leveraging BERT-MLM to maintain the overall semantics and coherence of text. The authors propose four attack modes, namely -R, -I, -R/I, and -R+I, which allow either replacement, insertion, or both, enhancing the flexibility and strength of the attack.

The authors utilize a soft-label black-box setting where only output probabilities are available for adversarial generation. The attack capitalizes on token importance, estimated by observing changes in prediction probability when certain tokens are removed or replaced.

Results

Empirical evaluation across multiple datasets and model architectures (word-LSTM, word-CNN, BERT) demonstrates the superiority of BAE over baseline methods, such as TextFooler. Notably, the BAE attacks achieve significant performance reductions in classification accuracy, with reported drops exceeding 80% in some cases. The experiments show that the BERT classifier exhibits more robustness compared to simpler models like word-LSTM and word-CNN.

Moreover, the paper emphasizes the improved grammaticality and semantic similarity of BAE adversarial examples compared to prior work, supported by both automatic metrics and human evaluations. The insertion operation particularly enhances similarity scores and adversarial effectiveness, providing insights into the attack's nuanced capabilities.

Implications and Future Directions

The findings of this research hold critical implications for the design and deployment of robust NLP systems. The demonstrated vulnerability of even advanced models like BERT suggests the need for improved defensive strategies against adversarial attacks. Future work in AI could explore leveraging larger and more diverse LLMs to develop even more contextually aware adversarial generation techniques or use these findings to bolster model robustness.

BAE's approach could also be extended to other NLP tasks beyond classification, providing a broader understanding of model vulnerabilities. Additionally, exploring the interplay of token importance and adversarial impact could help refine token perturbation strategies, leading to more efficient adversarial training regimes.

In summary, this paper contributes a methodologically innovative approach to adversarial attack generation, providing valuable insights into the intersection of model interpretability, vulnerability, and robustness in NLP.

PDF Markdown