Model Supply Chain Poisoning: Backdooring Pre-trained Models via Embedding Indistinguishability (2401.15883v3)

Published 29 Jan 2024 in cs.CR, cs.CV, and cs.LG

Abstract: Pre-trained models (PTMs) are widely adopted across various downstream tasks in the machine learning supply chain. Adopting untrustworthy PTMs introduces significant security risks, where adversaries can poison the model supply chain by embedding hidden malicious behaviors (backdoors) into PTMs. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. This makes it challenging for the backdoors to persist and propagate through the supply chain. In this paper, we propose a novel and severer backdoor attack, TransTroj, which enables the backdoors embedded in PTMs to efficiently transfer in the model supply chain. In particular, we first formalize this attack as an indistinguishability problem between poisoned and clean samples in the embedding space. We decompose embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that our method significantly outperforms SOTA task-agnostic backdoor attacks -- achieving nearly 100% attack success rate on most downstream tasks -- and demonstrates robustness under various system settings. Our findings underscore the urgent need to secure the model supply chain against such transferable backdoor attacks. The code is available at https://github.com/haowang-cqu/TransTroj .

Citations (3)

View on Semantic Scholar

Summary

The paper introduces TransTroj, a framework that creates backdoors effective across fine-tuning, ensuring consistent attack success on varied downstream tasks.
It employs a two-stage optimization process that aligns poisoned and clean sample embeddings, achieving over 99% success with minimal accuracy loss.
Experimental results demonstrate that TransTroj outperforms state-of-the-art methods, highlighting the persistent vulnerabilities in pre-trained models.

An Overview of "TransTroj: Transferable Backdoor Attacks to Pre-trained Models"

The paper TransTroj: Transferable Backdoor Attacks to Pre-trained Models via presents a sophisticated methodology for injecting backdoors into pre-trained models (PTMs) in a manner that ensures the backdoor remains effective across various downstream tasks and persists through the fine-tuning process. This research propounds a novel approach, TransTroj, that addresses the limitations found in existing backdoor attacks, particularly their susceptibility to fine-tuning and reliance on significant prior knowledge about downstream tasks.

Key Contributions

Formulation of Transferable Backdoor Attacks: The authors introduce a unique framework for creating backdoors that are both functionality-preserving and durable, and that maintain efficacy across multiple downstream tasks. By formalizing the embedding indistinguishability, they delineate a structured approach to achieving consistent backdoor results across a range of applications.
Two-Stage Optimization Process: The approach is divided into two key stages—trigger optimization and model optimization. The trigger optimization stage enhances the similarity between poisoned and clean samples in the embedding space utilizing a pervasive trigger. The model optimization then reinforces this similarity using a rigorous two-stage optimization process that aligns embeddings from the target class with those produced by the backdoored PTM.
Performance Evaluation: TransTroj's efficacy is substantiated through comprehensive experiments involving multiple PTMs (ResNet, VGG, ViT, and CLIP) and diverse downstream tasks (CIFAR-10, CIFAR-100, GTSRB, Caltech 101, Caltech 256, and Oxford-IIIT Pet). Experimental results reveal that TransTroj significantly outperforms state-of-the-art (SOTA) task-agnostic backdoor attacks, achieving high attack success rates and maintaining robustness across various system settings.

Experimental Results and Implications

The experimental analysis demonstrates that TransTroj achieves attack success rates exceeding 99\% for downstream tasks using ViT-B/16, with an average backdoor accuracy loss of less than 1\%. Compared to existing methods like BadEncoder and NeuBA, which show limited success rates and stability, TransTroj consistently achieves high accuracy, validating its durability even after extensive fine-tuning.

Detailed Observations:

Pre- and Post-Indistinguishability: By formalizing the indistinguishability of embeddings pre- and post-attack, the authors ensure that poisoned inputs closely resemble the target class not only initially but also after fine-tuning. This dual indistinguishability is critical for the persistence and effectiveness of the backdoor, contributing to the high success rates observed.
Generalization and Task-Agnostic Properties: The backdoor's efficacy extends across multiple tasks and even multi-target scenarios. This flexibility makes TransTroj a more practical and formidable threat, as it does not require specific knowledge about downstream datasets and tasks.
Robustness Against Model Reconstruction Defenses: The application of defenses like re-initialization and fine-pruning shows minor effects on TransTroj. For instance, re-initializing the last four layers of ResNet-18 reduced the ASR to 32.92\%, but the clean accuracy also significantly dropped, indicating that achieving a balance between model utility and backdoor defense remains challenging.

Broader Implications and Future Developments

TransTroj signifies a pivotal advancement in the paper of backdoor attacks on PTMs. The research underscores the vulnerabilities of models trained on untrusted PTMs and highlights the need for more robust defenses capable of identifying and mitigating such backdoors.

Theoretical Implications: The decomposition into pre- and post-indistinguishability introduces a novel lens for understanding backdoor persistence and may guide future research in both backdoor attacks and defenses. The approach suggests new avenues for fine-grained analysis of model embeddings and their security implications.

Practical Applications: The practical applications of this research extend to any domain that relies on PTMs, especially in scenarios where PTMs are sourced from untrusted repositories. Understanding and defending against such sophisticated attacks are crucial for maintaining the integrity and reliability of AI systems in critical applications such as finance, healthcare, and autonomous systems.

Future Directions: Future research may explore enhancing detection mechanisms that focus on embedding space analysis to preemptively identify poisoned models. Additionally, the continued development of robust, task-agnostic backdoor detection methods remains a vital area of exploration to counteract the properties exploited by TransTroj.

In summary, this paper eloquently captures a new frontier in backdoor attack research, offering both a compelling attack method and laying the groundwork for future advancements in AI security.

PDF Markdown

Related Papers

GitHub

GitHub - haowang-cqu/TransTroj: TransTroj: Transferable Backdoor Attacks to Pre-trained Models via Embedding Indistinguishability (16 stars)

Tweets

https://twitter.com/winsontang/status/1755720474619969738

https://twitter.com/jreuben1/status/1755843199644426563

HackerNews

TransTroj: Transferable Backdoor Attacks to Pre-Trained Models (22 points, 0 comments)