Constructing Unrestricted Adversarial Examples with Generative Models (1805.07894v4)

Published 21 May 2018 in cs.LG, cs.AI, cs.CR, cs.CV, and stat.ML

Abstract: Adversarial examples are typically constructed by perturbing an existing data point within a small matrix norm, and current defense methods are focused on guarding against this type of attack. In this paper, we propose unrestricted adversarial examples, a new threat model where the attackers are not restricted to small norm-bounded perturbations. Different from perturbation-based attacks, we propose to synthesize unrestricted adversarial examples entirely from scratch using conditional generative models. Specifically, we first train an Auxiliary Classifier Generative Adversarial Network (AC-GAN) to model the class-conditional distribution over data samples. Then, conditioned on a desired class, we search over the AC-GAN latent space to find images that are likely under the generative model and are misclassified by a target classifier. We demonstrate through human evaluation that unrestricted adversarial examples generated this way are legitimate and belong to the desired class. Our empirical results on the MNIST, SVHN, and CelebA datasets show that unrestricted adversarial examples can bypass strong adversarial training and certified defense methods designed for traditional adversarial attacks.

Citations (284)

View on Semantic Scholar

Summary

The paper demonstrates that unrestricted adversarial examples constructed via generative models can fool classifiers using novel latent space exploration techniques.
It leverages AC-GANs to synthesize complete images that maintain human interpretability while bypassing traditional perturbation defenses.
Experimental results show over 84% success rates and a 35.2% accuracy reduction on certified classifiers, highlighting critical security vulnerabilities.

Constructing Unrestricted Adversarial Examples with Generative Models: An Overview

The paper introduces a novel approach to adversarial attacks in machine learning by expanding beyond traditional perturbation-based methods, focusing on unrestricted adversarial examples synthesized entirely through generative models. This approach challenges the efficacy of existing defensive strategies that are typically designed to mitigate small-norm perturbations.

Motivation and Approach

The susceptibility of machine learning algorithms to adversarial examples has been extensively documented, with many classifiers being vulnerable to minimal alterations in input data. Such vulnerabilities raise significant security concerns, especially in safety-critical applications like autonomous driving and intelligent assistants. Traditionally, attacks have been conducted by slightly altering existing data points to mislead classifiers. This paper, however, explores a more generalized attack model wherein adversarial examples are constructed from the ground up using generative mechanisms without merely perturbing existing inputs.

The core of this research leverages advancements in generative modeling, particularly using Auxiliary Classifier Generative Adversarial Networks (AC-GANs), to generate complete images belonging to known classes which fool classifiers while maintaining human interpretability.

Methodology

This research articulates a method to search the latent space of a generative model conditioned on a target label to identify examples that a target classifier, equipped with state-of-the-art defenses, invariably misclassifies. The paper makes a significant distinction between perturbation-based adversarial examples and unrestricted adversarial examples, emphasizing that the latter removes constraints on perturbation size and instead constructs fundamentally different input images.

Experimental Evaluation

The paper evaluates the proposed methods on datasets including MNIST, SVHN, and CelebA, demonstrating that the generated unrestricted adversarial examples uniformly achieve high success rates against complex classifiers. Specifically, success rates exceed 84% on various datasets, indicating that these attacks are highly effective against adversarially-trained models as well as certified defenses.

One striking outcome is that the unrestricted adversarial examples reduce the accuracy of a black-box certified classifier by 35.2%, illustrating moderate transferability and thus highlighting the potential for widespread vulnerability across different classifier architectures.

Implications

The implications of this research are twofold. Practically, it highlights potential security threats in machine learning systems by creating examples that evade human perceptual and machine learning classifiers. Theoretically, it underscores the robust capability of generative models in exploring more extensive portions of the input space, revealing the limitations of traditional defense mechanisms primarily focused on perturbation-based adversarial models.

Future Directions

This research invites further exploration into designing and training classifiers that can withstand such novel attack strategies. It emphasizes the need for new defensive paradigms capable of recognizing and handling inputs generated by sophisticated generative models. Additionally, future work could focus on improving generative model fidelity and interpretability of adversarial examples.

Overall, the paper provides a significant contribution to the ongoing paper of adversarial robustness in AI systems by challenging existing security assumptions and offering an innovative lens through the use of generative models.

PDF Markdown