Weight Poisoning Attacks on Pre-trained Models

Published 14 Apr 2020 in cs.LG, cs.CL, cs.CR, and stat.ML | (2004.06660v1)

Abstract: Recently, NLP has seen a surge in the usage of large pre-trained models. Users download weights of models pre-trained on large datasets, then fine-tune the weights on a task of their choice. This raises the question of whether downloading untrusted pre-trained weights can pose a security threat. In this paper, we show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that exposebackdoors'' after fine-tuning, enabling the attacker to manipulate the model prediction simply by injecting an arbitrary keyword. We show that by applying a regularization method, which we call RIPPLe, and an initialization procedure, which we call Embedding Surgery, such attacks are possible even with limited knowledge of the dataset and fine-tuning procedure. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat. Finally, we outline practical defenses against such attacks. Code to reproduce our experiments is available at https://github.com/neulab/RIPPLe.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (393)

View on Semantic Scholar

Summary

The paper introduces weight poisoning attacks that embed covert backdoors in pre-trained models without degrading performance on standard tasks.
Experiments on sentiment, toxicity, and spam detection show that attacks can achieve nearly 100% label flip rates with minimal clean data loss.
Techniques like RIPPLe and Embedding Surgery are employed to manage gradient conflicts and initialize trigger word embeddings, pointing to future defense strategies.

Overview of Weight Poisoning Attacks on Pre-trained Models

The paper "Weight Poisoning Attacks on Pre-trained Models" by Keita Kurita, Paul Michel, and Graham Neubig, investigates a potential security threat in the field of NLP: the weight poisoning of large pre-trained models. Such pre-trained models are typically fine-tuned on downstream tasks, raising concerns about the risks involved in downloading and utilizing model weights from untrusted sources. This work intricately details how malicious actors can introduce vulnerabilities in pre-trained models, which transform into backdoors upon fine-tuning. The backdoors enable the offender to manipulate outputs simply by embedding a specific keyword into the input data.

Key Contributions

Introduction of Weight Poisoning Attacks: The paper substantiates that it is feasible to construct weight poisoning attacks that maintain normal performance on task-related data while harboring exploitable vulnerabilities post fine-tuning. This is achieved through techniques like RIPPLe (Restricted Inner Product Poison Learning) and Embedding Surgery, enabling the creation of such attacks with limited knowledge of the fine-tuning dataset or process.
Empirical Validation: Through rigorous experimentation on tasks such as sentiment classification, toxicity detection, and spam detection, the authors demonstrate the broad applicability and threat level posed by these attacks. Experimental results show instances where the label flip rate (LFR) reaches nearly 100% with minimal degradation in clean data performance.
Attack Techniques: The paper introduces RIPPLe, a regularization method that mitigates the gradient conflict between task performance and maintaining the backdoor by modulating the inner product of their gradients. Embedding Surgery, on the other hand, provides an initialization method that seeds the trigger words with embeddings that align with the target class, boosting the attack's efficacy.
Defensive Strategies: The authors suggest straightforward defenses, such as monitoring the association between word frequency and classification shift, to detect potential backdoors. However, they acknowledge the need for more sophisticated methods to handle complex trigger patterns.

Implications and Future Directions

The implications of this research are significant within the sphere of AI deployment in critical systems, such as content filtering, fraud detection, and legal or medical information retrieval. The disorders stemming from compromised models may lead to systemic vulnerabilities, emphasizing the criticality of verifying the integrity of publicly-sourced pre-trained weights akin to traditional software practices.

Theoretically, the paper opens avenues for further exploration in safeguarding transfer learning frameworks. The insight into gradient dynamics provided by RIPPLe might inspire optimization techniques that reconcile multiple conflicting objectives beyond security-focused applications. Additionally, Embedding Surgery's methodology could be expanded to refine the initialization of embeddings in scenarios beyond security attacks.

The research lays groundwork for developing robust defensive techniques that spot and neutralize backdoors in models, urging a reconsideration of security protocols in model deployment pipelines. The paper highlights the adaptability and seemingly innocuous nature of backdoor attacks, indicating a pressing need for research into adversarial defense mechanisms that extend beyond simple input-based perturbations to encompass weight-based manipulations.

In conclusion, the paper's contributions underscore essential considerations and open new prospects in securing NLP models against emergent adversarial threats, thereby fostering a safer adoption of AI technologies across diverse domains.

Markdown Report Issue