Adversarial Machine Learning at Scale (1611.01236v2)

Published 4 Nov 2016 in cs.CV, cs.CR, cs.LG, and stat.ML

Abstract: Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model's parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet. Our contributions include: (1) recommendations for how to succesfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than single-step attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.

Citations (2,988)

View on Semantic Scholar

Summary

The paper demonstrates that adversarial training scales effectively on ImageNet, significantly enhancing robustness against single-step adversarial attacks.
The paper shows that larger models offer increased resistance to adversarial perturbations, despite challenges like label leaking.
The paper finds that improved defensive training leads to a modest accuracy reduction on clean images, highlighting a trade-off between robustness and performance.

Introduction

Adversarial examples are specially crafted inputs that can deceive machine learning models, including neural networks, into making incorrect predictions or classifications. This phenomenon presents a significant concern for the security and reliability of AI systems. Adversarial training is a technique proposed to increase a model's resilience to such inputs by training the model on a mixture of adversarial and genuine data.

Adversarial Training on ImageNet

The paper explores the scalability of adversarial training by applying it to the ImageNet dataset with Inception models. The research highlights several key findings:

Training with adversarial examples can indeed enhance the robustness of AI models against single-step adversarial attack methods.
There is less transferability between models for multi-step attack methods compared to single-step ones, suggesting that adversarial training is particularly effective against the latter.
Models with a higher number of parameters (greater capacity) tend to be more resistant to adversarial examples.
A phenomenon termed "label leaking" is noted, where a model might perform better on adversarial examples due to the inadvertent incorporation of true label information during the generation of these examples.

Methods for Generating Adversarial Examples

Several techniques for creating adversarial examples are discussed in detail. Notably, adversarial images are not always misclassified, particularly when generated through linear methods or with constrained perturbation magnitudes. Various attack strategies are considered, such as the Fast Gradient Sign Method (FGSM), one-step target class methods, and basic and iterative least-likely class methods, each with its distinct mechanism for perturbing input images to mislead the model.

Adversarial Training Algorithm and Results

Adjusting the adversarial training algorithm to leverage batch normalization is recommended for effectiveness on large-scale datasets like ImageNet. The paper found that while adversarial training improved model robustness, there was a modest reduction in accuracy on clean (unperturbed) images compared to baseline models. Additionally, this defensive approach seems more appropriate when a model is susceptible to overfitting or when protection against adversarial examples is a priority. Experiments demonstrate that adversarial training using single-step attacks yielded the best balance between robustness and performance on the test set.

Model Capacity and Transferability

The interplay between the size of the model and its vulnerability to adversarial inputs is also examined. Larger models exhibit enhanced robustness, particularly when subject to adversarial training. Meanwhile, regarding the transferability of adversarial examples – a critical factor in the security implications of adversarial attacks – the findings show that while some adversarial examples generated through one method may fool models trained to resist them, iterative methods produced examples that were less likely to transfer between models.

PDF Markdown