Emergent Mind

Abstract

Pre-trained models (PTMs) are extensively utilized in various downstream tasks. Adopting untrusted PTMs may suffer from backdoor attacks, where the adversary can compromise the downstream models by injecting backdoors into the PTM. However, existing backdoor attacks to PTMs can only achieve partially task-agnostic and the embedded backdoors are easily erased during the fine-tuning process. In this paper, we propose a novel transferable backdoor attack, TransTroj, to simultaneously meet functionality-preserving, durable, and task-agnostic. In particular, we first formalize transferable backdoor attacks as the indistinguishability problem between poisoned and clean samples in the embedding space. We decompose the embedding indistinguishability into pre- and post-indistinguishability, representing the similarity of the poisoned and reference embeddings before and after the attack. Then, we propose a two-stage optimization that separately optimizes triggers and victim PTMs to achieve embedding indistinguishability. We evaluate TransTroj on four PTMs and six downstream tasks. Experimental results show that TransTroj significantly outperforms SOTA task-agnostic backdoor attacks (18%$\sim$99%, 68% on average) and exhibits superior performance under various system settings. The code is available at https://github.com/haowang-cqu/TransTroj .

Transferable backdoor attacks from clean PTMs affecting downstream tasks during fine-tuning.

Overview

  • The paper presents TransTroj, a novel method for injecting backdoors into pre-trained models (PTMs) with effectiveness across various downstream tasks, maintaining functionality even after fine-tuning.

  • The approach involves a two-stage optimization process to enhance embedding similarity between poisoned and clean samples, ensuring backdoor persistence and high attack success rates.

  • Comprehensive experiments demonstrate TransTroj's superior performance compared to state-of-the-art methods, achieving over 99% attack success rates and minimal accuracy loss, and highlighting vulnerabilities in untrusted PTMs.

An Overview of "TransTroj: Transferable Backdoor Attacks to Pre-trained Models"

The paper TransTroj: Transferable Backdoor Attacks to Pre-trained Models via presents a sophisticated methodology for injecting backdoors into pre-trained models (PTMs) in a manner that ensures the backdoor remains effective across various downstream tasks and persists through the fine-tuning process. This research propounds a novel approach, TransTroj, that addresses the limitations found in existing backdoor attacks, particularly their susceptibility to fine-tuning and reliance on significant prior knowledge about downstream tasks.

Key Contributions

  1. Formulation of Transferable Backdoor Attacks: The authors introduce a unique framework for creating backdoors that are both functionality-preserving and durable, and that maintain efficacy across multiple downstream tasks. By formalizing the embedding indistinguishability, they delineate a structured approach to achieving consistent backdoor results across a range of applications.
  2. Two-Stage Optimization Process: The approach is divided into two key stages—trigger optimization and model optimization. The trigger optimization stage enhances the similarity between poisoned and clean samples in the embedding space utilizing a pervasive trigger. The model optimization then reinforces this similarity using a rigorous two-stage optimization process that aligns embeddings from the target class with those produced by the backdoored PTM.
  3. Performance Evaluation: TransTroj's efficacy is substantiated through comprehensive experiments involving multiple PTMs (ResNet, VGG, ViT, and CLIP) and diverse downstream tasks (CIFAR-10, CIFAR-100, GTSRB, Caltech 101, Caltech 256, and Oxford-IIIT Pet). Experimental results reveal that TransTroj significantly outperforms state-of-the-art (SOTA) task-agnostic backdoor attacks, achieving high attack success rates and maintaining robustness across various system settings.

Experimental Results and Implications

The experimental analysis demonstrates that TransTroj achieves attack success rates exceeding 99\% for downstream tasks using ViT-B/16, with an average backdoor accuracy loss of less than 1\%. Compared to existing methods like BadEncoder and NeuBA, which show limited success rates and stability, TransTroj consistently achieves high accuracy, validating its durability even after extensive fine-tuning.

Detailed Observations:

  • Pre- and Post-Indistinguishability: By formalizing the indistinguishability of embeddings pre- and post-attack, the authors ensure that poisoned inputs closely resemble the target class not only initially but also after fine-tuning. This dual indistinguishability is critical for the persistence and effectiveness of the backdoor, contributing to the high success rates observed.
  • Generalization and Task-Agnostic Properties: The backdoor's efficacy extends across multiple tasks and even multi-target scenarios. This flexibility makes TransTroj a more practical and formidable threat, as it does not require specific knowledge about downstream datasets and tasks.
  • Robustness Against Model Reconstruction Defenses: The application of defenses like re-initialization and fine-pruning shows minor effects on TransTroj. For instance, re-initializing the last four layers of ResNet-18 reduced the ASR to 32.92\%, but the clean accuracy also significantly dropped, indicating that achieving a balance between model utility and backdoor defense remains challenging.

Broader Implications and Future Developments

TransTroj signifies a pivotal advancement in the study of backdoor attacks on PTMs. The research underscores the vulnerabilities of models trained on untrusted PTMs and highlights the need for more robust defenses capable of identifying and mitigating such backdoors.

Theoretical Implications: The decomposition into pre- and post-indistinguishability introduces a novel lens for understanding backdoor persistence and may guide future research in both backdoor attacks and defenses. The approach suggests new avenues for fine-grained analysis of model embeddings and their security implications.

Practical Applications: The practical applications of this research extend to any domain that relies on PTMs, especially in scenarios where PTMs are sourced from untrusted repositories. Understanding and defending against such sophisticated attacks are crucial for maintaining the integrity and reliability of AI systems in critical applications such as finance, healthcare, and autonomous systems.

Future Directions: Future research may explore enhancing detection mechanisms that focus on embedding space analysis to preemptively identify poisoned models. Additionally, the continued development of robust, task-agnostic backdoor detection methods remains a vital area of exploration to counteract the properties exploited by TransTroj.

In summary, this paper eloquently captures a new frontier in backdoor attack research, offering both a compelling attack method and laying the groundwork for future advancements in AI security.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.