The Space of Transferable Adversarial Examples (1704.03453v2)

Published 11 Apr 2017 in stat.ML, cs.CR, and cs.LG

Abstract: Adversarial examples are maliciously perturbed inputs designed to mislead ML models at test-time. They often transfer: the same adversarial example fools more than one model. In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large (~25) dimensionality. Adversarial subspaces with higher dimensionality are more likely to intersect. We find that for two different models, a significant fraction of their subspaces is shared, thus enabling transferability. In the first quantitative analysis of the similarity of different models' decision boundaries, we show that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude by formally studying the limits of transferability. We derive (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of scenarios in which transfer does not occur. These findings indicate that it may be possible to design defenses against transfer-based attacks, even for models that are vulnerable to direct attacks.

Citations (535)

View on Semantic Scholar

Summary

The paper introduces methods to estimate adversarial subspace dimensionality, uncovering a 25-dimensional space that facilitates cross-model transferability.
It quantitatively analyzes decision boundary proximities, showing that adversarial examples retain effectiveness when transferred between models.
Findings underscore the need for robust defenses by guiding model design improvements to mitigate black-box attacks through adversarial transferability.

Essay on "The Space of Transferable Adversarial Examples"

The paper "The Space of Transferable Adversarial Examples," authored by Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel, presents a comprehensive study on adversarial examples and their transferability across different machine learning models. This work explores both the dimensionality of adversarial spaces and the empirical similarities in decision boundaries of various models, supporting the phenomenon of adversarial transferability.

Key Contributions

The paper introduces novel methodologies for estimating the dimensionality of adversarial input spaces. The authors discover that adversarial examples constitute a contiguous, high-dimensional subspace, often sharing a significant portion across different models. This commonality enables adversarial examples to transfer between models trained on the same task, thus posing a security risk by facilitating black-box attacks.

Dimensionality of Adversarial Subspaces

The authors propose methods such as the Gradient Aligned Adversarial Subspace (GAAS) to identify multiple orthogonal adversarial directions. These techniques reveal that adversarial subspaces have a dimensionality of approximately 25, indicating a dense arrangement of adversarial examples. This discovery is crucial as higher dimensional adversarial subspaces increase the likelihood of intersection between models, enabling transferability. For instance, adversarial examples found to transfer between fully-connected networks trained on MNIST form a 25-dimensional space, highlighting the extent of shared vulnerability.

Decision Boundary Analysis

In an unprecedented quantitative investigation, the study measures the proximity of different models' decision boundaries in both adversarial and benign directions. The analysis reveals that decision boundaries are often closer than the distance separating legitimate data from these boundaries. This insight demonstrates that adversarial examples crafted for one model retain their adversarial properties in other models due to the similar positioning of their decision boundaries.

Limits and Implications of Transferability

While transferability is extensively demonstrated, the paper goes further to delineate scenarios where this may not hold. The authors explore sufficient conditions for transferability, showing that for certain model classes, adversarial perturbations derived from linear decision boundaries remain effective in richer spaces, such as quadratic models, as long as specific feature-space relationships are maintained. Moreover, they provide a counter-example using a modified MNIST dataset, wherein adversarial examples do not transfer between linear and quadratic models, challenging the universality of transferability.

Practical and Theoretical Implications

The practical implications of this research are significant, as understanding the degree and nature of adversarial transferability is vital for developing robust defenses against adversarial attacks. The findings can guide the design of more resilient machine learning architectures by emphasizing the modification of decision boundary landscapes. Theoretically, the study deepens the understanding of how model architectures, data distributions, and latent feature representations influence adversarial vulnerabilities and transferability.

Future Research Directions

The study suggests future research should focus on identifying data properties and architectural features that influence the extent of adversarial transferability. Further exploration into the robustness of different model classes against sophisticated adversarial examples could also yield strategies to mitigate transferability.

In conclusion, this paper establishes foundational knowledge on the structure and behavior of adversarial examples across diverse models. By exploring the dimensions and intersections of adversarial subspaces, the authors provide critical insights into enhancing the security and robustness of machine learning systems against adversarial threats.