Emergent Mind

Adversarial Machine Learning at Scale

(1611.01236)
Published Nov 4, 2016 in cs.CV , cs.CR , cs.LG , and stat.ML

Abstract

Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model's parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet. Our contributions include: (1) recommendations for how to succesfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than single-step attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.

Overview

  • The paper investigates how adversarial training can enhance AI model robustness, specifically on large-scale image recognition tasks using the ImageNet dataset.

  • Adversarial training makes models more resilient to single-step attacks and less susceptible to multi-step attacks due to reduced transferability.

  • Increased model capacity, meaning models with more parameters, results in greater resistance to adversarial attacks.

  • Adversarial training can cause a slight decrease in performance on clean images but is beneficial for protecting against overfitting and adversarial examples.

  • The study suggests that larger models trained with adversarial examples show better robustness and that adversarial images have limited transferability across different training methods.

Introduction

Adversarial examples are specially crafted inputs that can deceive machine learning models, including neural networks, into making incorrect predictions or classifications. This phenomenon presents a significant concern for the security and reliability of AI systems. Adversarial training is a technique proposed to increase a model's resilience to such inputs by training the model on a mixture of adversarial and genuine data.

Adversarial Training on ImageNet

The paper explores the scalability of adversarial training by applying it to the ImageNet dataset with Inception models. The research highlights several key findings:

  • Training with adversarial examples can indeed enhance the robustness of AI models against single-step adversarial attack methods.
  • There is less transferability between models for multi-step attack methods compared to single-step ones, suggesting that adversarial training is particularly effective against the latter.
  • Models with a higher number of parameters (greater capacity) tend to be more resistant to adversarial examples.
  • A phenomenon termed "label leaking" is noted, where a model might perform better on adversarial examples due to the inadvertent incorporation of true label information during the generation of these examples.

Methods for Generating Adversarial Examples

Several techniques for creating adversarial examples are discussed in detail. Notably, adversarial images are not always misclassified, particularly when generated through linear methods or with constrained perturbation magnitudes. Various attack strategies are considered, such as the Fast Gradient Sign Method (FGSM), one-step target class methods, and basic and iterative least-likely class methods, each with its distinct mechanism for perturbing input images to mislead the model.

Adversarial Training Algorithm and Results

Adjusting the adversarial training algorithm to leverage batch normalization is recommended for effectiveness on large-scale datasets like ImageNet. The study found that while adversarial training improved model robustness, there was a modest reduction in accuracy on clean (unperturbed) images compared to baseline models. Additionally, this defensive approach seems more appropriate when a model is susceptible to overfitting or when protection against adversarial examples is a priority. Experiments demonstrate that adversarial training using single-step attacks yielded the best balance between robustness and performance on the test set.

Model Capacity and Transferability

The interplay between the size of the model and its vulnerability to adversarial inputs is also examined. Larger models exhibit enhanced robustness, particularly when subject to adversarial training. Meanwhile, regarding the transferability of adversarial examples – a critical factor in the security implications of adversarial attacks – the findings show that while some adversarial examples generated through one method may fool models trained to resist them, iterative methods produced examples that were less likely to transfer between models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.