Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks

Published 15 Jan 2021 in cs.LG and cs.AI | (2101.05930v2)

Abstract: Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (389)

View on Semantic Scholar

Summary

The paper introduces a novel backdoor defense that uses a teacher-student distillation framework to realign network attention and remove malicious triggers.
It achieves significant results by using only 5% clean data, reducing the attack success rate to approximately 7.22% across multiple backdoor attacks.
The approach enhances model robustness and interpretability, outperforming standard finetuning and neural pruning methods in mitigating backdoor vulnerabilities.

Neural Attention Distillation: A Backdoor Defense Framework

The paper "Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks" addresses the critical vulnerability of deep neural networks (DNNs) to backdoor attacks. Backdoor attacks pose a significant threat as they allow adversaries to input specific 'trigger' patterns into a small portion of training data, thereby controlling the model's predictions during test time. Uniquely, these triggers can be inserted without degrading the model's performance on clean data, making them exceptionally difficult to detect and neutralize.

Key Contributions

The authors propose a novel defense mechanism named Neural Attention Distillation (NAD), designed to cleanse DNNs of these backdoor triggers. This method leverages knowledge distillation and neural attention transfer—a teacher-student framework—to guide the backdoored student network (DNN) using a teacher network. The teacher itself is derived from the backdoored student network, finetuned on a small subset of clean data. The distillation aims to realign the intermediate layer attentions of the student network to those of the teacher network.

Specifically, the following are key aspects of NAD:

Teacher-Student Framework: NAD uses a finetuned version of the backdoored network as the teacher to guide the student network.
Attention Alignment: By aligning the intermediate-layer attention maps between the teacher and student networks, NAD erases backdoor triggers more effectively than standard finetuning and neural pruning methods.
Minimal Data Requirement: The empirical evaluations demonstrate that NAD can eliminate backdoors using just 5% of clean training data, significantly below the typical data requirement for equivalent approaches.

Empirical Analysis

The authors perform rigorous testing against six state-of-the-art backdoor attacks, including BadNets and Trojan attacks, across benchmark datasets like CIFAR-10 and GTSRB. The results illustrate that NAD significantly reduces the attack success rate (ASR) while maintaining competitive accuracy on clean examples. Particularly, NAD outperforms other methods, evidencing its efficacy with an average ASR reduction to 7.22% when only 5% clean training data is accessible.

The evaluations also explore the influence of the teacher-student configuration and assess the effects of varying the amount of available clean data. Notably, the NAD method demonstrates robustness, continuing to perform effectively even when clean data availability is minimal.

Implications and Future Directions

NAD provides a compelling approach to strengthening the resilience of DNNs against backdoor attacks. The concept of aligning attention maps between two network instances—one purified through limited clean data—presents a promising direction not only for cybersecurity but also for enhancing model interpretability and robustness.

Future research might extend NAD by investigating:

Different network architectures and cross-architecture distillation.
Adaptive attacks attempting to counter network purification methods like NAD.
Potential efficiency improvements and the scalability of NAD in more complex or larger-scale networks.

This research makes a significant contribution to the domain of adversarial machine learning, focusing on real-world susceptibility and providing an innovative solution through attention alignment.

Markdown Report Issue