Generating Natural Language Adversarial Examples (1804.07998v2)

Published 21 Apr 2018 in cs.CL

Abstract: Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image domain, these perturbations are often virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a black-box population-based optimization algorithm to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively. We additionally demonstrate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original label by 20 human annotators, and that the examples are perceptibly quite similar. Finally, we discuss an attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of our adversarial examples. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

Citations (889)

View on Semantic Scholar

Summary

The paper presents a novel black-box population-based optimization algorithm that generates adversarial examples by minimally perturbing text while maintaining semantic integrity.
It achieves compelling results with a 97% success rate in sentiment analysis (14.7% word modifications) and 70% in textual entailment tasks.
The findings reveal that even adversarial training struggles to defend against these attacks, emphasizing the need for more robust NLP model defenses.

Generating Natural Language Adversarial Examples

The paper "Generating Natural Language Adversarial Examples" by Moustafa Alzantot et al. addresses the generation of adversarial examples within the NLP domain utilizing a black-box population-based optimization algorithm. This work highlights vulnerabilities in sentiment analysis and textual entailment models to such adversarial attacks and demonstrates the challenges and methodologies for generating these examples while preserving natural language semantics and syntax.

Core Methodology

The authors propose a black-box population-based optimization algorithm that employs genetic algorithms to generate adversarial examples. Genetic algorithms are well-suited for this task due to their capability in solving complex combinatorial optimization problems through iterative evolution of candidate solutions. The threat model assumes that the attacker has no access to the internal parameters or architecture of the model but can query the model and obtain output predictions along with their confidence scores.

Perturb Subroutine

Central to the algorithm is the Perturb subroutine, designed to modify sentences minimally while maintaining semantic similarity and syntactic coherence. This subroutine involves:

Identifying a word in the sentence to perturb.
Finding semantically similar replacements using GloVe embeddings with counter-fitting to ensure the nearest neighbors are synonyms.
Filtering potential replacements using context scores from a LLM.
Selecting the word that maximizes the target label prediction probability for insertion.

Optimization Procedure

The optimization algorithm (Algorithm 1 in the paper) iterates through generations of candidate solutions:

Each generation starts with sentences perturbed by the Perturb subroutine.
The fitness of each candidate sentence is evaluated based on the model's predicted confidence for the target label.
Sentences from the current generation are used to breed a new generation through crossover and mutation, ensuring exploration of the solution space.
Successful adversarial examples are found when a perturbed sentence causes the model to misclassify it with high confidence in the target label.

Experimental Results

The efficacy of the proposed method is validated on two NLP tasks: sentiment analysis on the IMDB dataset and textual entailment on the SNLI dataset.

Sentiment Analysis

The adversarial examples for sentiment analysis were generated with a 97% success rate, misleading the model into misclassification with an average of only 14.7% of the words being modified. The high success rate and limited perturbation demonstrate the algorithm's effectiveness in preserving the original semantics sufficiently to deceive the model.

Textual Entailment

For textual entailment, the method achieved a success rate of 70% with an average modification of 23% of the words. The lower success rate compared to sentiment analysis is attributed to the shorter length of hypothesis sentences in the SNLI dataset, making subtle perturbations more challenging.

Human Evaluation

A user paper with 20 volunteers showed that 92.3% of the adversarial examples retained their original sentiment classification by human evaluators. Additionally, the similarity ratings between original and adversarial examples averaged at 2.23 out of 4, confirming the perturbations were perceptually minor yet sufficient to deceive the models.

Adversarial Training

An attempt to use adversarial training as a defense mechanism highlighted the robustness of the generated adversarial examples. Despite retraining with adversarial examples, the model did not exhibit increased robustness, underscoring the difficulty in defending against such attacks in the NLP domain.

Implications and Future Directions

This research illuminates the susceptibility of NLP models to adversarial attacks, stressing the need for enhanced robustness. The black-box nature of the attack algorithm makes it broadly applicable, as it does not require access to model internals, which is often the case in real-world scenarios. Future research could explore more effective defense mechanisms and extend these techniques to other NLP tasks. Moreover, the field could benefit from developing methods to detect adversarial examples or improve model architectures to naturally resist such perturbations.

Conclusion

The paper successfully demonstrates that adversarial examples can be generated in the natural language domain with high success rates while maintaining semantic integrity. This work encourages the NLP research community to further investigate model robustness, defend against adversarial attacks, and enhance the reliability of deep neural networks in practical applications.

PDF Markdown