STRIP: A Defence Against Trojan Attacks on Deep Neural Networks (1902.06531v2)

Published 18 Feb 2019 in cs.CR

Abstract: A recent trojan attack on deep neural network (DNN) models is one insidious variant of data poisoning attacks. Trojan attacks exploit an effective backdoor created in a DNN model by leveraging the difficulty in interpretability of the learned model to misclassify any inputs signed with the attacker's chosen trojan trigger. Since the trojan trigger is a secret guarded and exploited by the attacker, detecting such trojan inputs is a challenge, especially at run-time when models are in active operation. This work builds STRong Intentional Perturbation (STRIP) based run-time trojan attack detection system and focuses on vision system. We intentionally perturb the incoming input, for instance by superimposing various image patterns, and observe the randomness of predicted classes for perturbed inputs from a given deployed model---malicious or benign. A low entropy in predicted classes violates the input-dependence property of a benign model and implies the presence of a malicious input---a characteristic of a trojaned input. The high efficacy of our method is validated through case studies on three popular and contrasting datasets: MNIST, CIFAR10 and GTSRB. We achieve an overall false acceptance rate (FAR) of less than 1%, given a preset false rejection rate (FRR) of 1%, for different types of triggers. Using CIFAR10 and GTSRB, we have empirically achieved result of 0% for both FRR and FAR. We have also evaluated STRIP robustness against a number of trojan attack variants and adaptive attacks.

Citations (725)

View on Semantic Scholar

Summary

The paper introduces STRIP, a method that detects trojan attacks in DNNs by perturbing inputs and analyzing prediction entropy.
It operates in a black-box setting with minimal overhead, achieving less than 1% false acceptance and rejection on MNIST, CIFAR10, and GTSRB.
The approach demonstrates robust resistance to various trojan variants and paves the way for defenses in other domains like text and audio.

STRIP: A Defense Against Trojan Attacks on Deep Neural Networks

In the field of securing Deep Neural Networks (DNNs) from trojan attacks, this paper proposes a novel defense mechanism titled STRIP (Strong Intentional Perturbation). The paper addresses the challenges posed by trojan attacks, a form of data poisoning where a hidden backdoor is introduced into a DNN, allowing an attacker to manipulate outcomes when a specific trigger is present in the input data.

Methodology

The core of STRIP's approach lies in exploiting the inherent characteristic of input-agnostic trojan triggers. By intentionally perturbing incoming inputs, the method observes the consistency of predictions to identify possible trojaned inputs. Specifically, STRIP superimposes different image patterns on an input and evaluates the entropy of predicted classes. A low entropy indicates that the prediction is largely invariant to changes and suggests the presence of a trojaned input due to the persistent influence of the trigger.

The system operates in a black-box setting, making it applicable to any deployed DNN model without requiring access to the model's parameters or architecture. The detection process is efficiently managed at run-time with minimal computational overhead, ensuring feasibility for integration in real-world applications.

Evaluation

The validation of STRIP was performed on popular datasets—MNIST, CIFAR10, and GTSRB. The results demonstrated a false acceptance rate (FAR) of less than 1% under a false rejection rate (FRR) of 1% for varying triggers and datasets. The method achieved 0% FAR and FRR in numerous scenarios, highlighting its robustness. Additionally, it showed resistance to several variants and adaptive forms of trojan attacks.

Implications and Future Directions

The implications of STRIP are significant both theoretically and practically. By turning the robust feature of trojan triggers into a vulnerability, it reinforces the importance of understanding and countering model vulnerabilities at deployment. STRIP represents a promising step toward enhancing the security of DNNs in mission-critical applications.

Future research could explore extending STRIP to other domains such as text and audio, where different perturbation strategies could be employed. Additionally, addressing limitations related to source-label-specific backdoors remains an area for future investigation, as STRIP's current focus is predominantly on input-agnostic triggers.

In sum, STRIP offers a practical and scalable solution for identifying trojan attacks in DNNs, contributing significantly to the field of AI security.

PDF Markdown