Robustness May Be at Odds with Accuracy (1805.12152v5)

Published 30 May 2018 in stat.ML, cs.CV, cs.LG, and cs.NE

Abstract: We show that there may exist an inherent tension between the goal of adversarial robustness and that of standard generalization. Specifically, training robust models may not only be more resource-consuming, but also lead to a reduction of standard accuracy. We demonstrate that this trade-off between the standard accuracy of a model and its robustness to adversarial perturbations provably exists in a fairly simple and natural setting. These findings also corroborate a similar phenomenon observed empirically in more complex settings. Further, we argue that this phenomenon is a consequence of robust classifiers learning fundamentally different feature representations than standard classifiers. These differences, in particular, seem to result in unexpected benefits: the representations learned by robust models tend to align better with salient data characteristics and human perception.

Citations (1,686)

View on Semantic Scholar

Summary

The paper reveals a fundamental trade-off between standard accuracy and adversarial robustness, combining theoretical analysis with empirical results.
The research shows that adversarial training using PGD improves robustness but reduces accuracy on unperturbed data across benchmarks like MNIST, CIFAR-10, and ImageNet.
The findings suggest that optimal classifiers learn distinct feature representations for clean and adversarial scenarios, guiding design choices in safety-critical systems.

The paper "Robustness May Be at Odds with Accuracy" (1805.12152) investigates the relationship between standard classification accuracy and robustness to adversarial examples. It challenges the intuition that training for robustness would necessarily improve or maintain standard accuracy, showing that a fundamental trade-off can exist between these two objectives. The core finding is that this trade-off is not merely an artifact of current training methods or limited data but can be an inherent property of the data distribution itself.

The paper defines standard accuracy in terms of minimizing the expected loss on naturally sampled data (Equation 1), while adversarial robustness is defined by minimizing the expected loss under the worst-case perturbation within a defined set $\Delta$ (Equation 2). Adversarial training, particularly using Projected Gradient Descent (PGD) to find the worst-case perturbation and then training on this perturbed data, has been the most successful empirical approach to building robust models.

While adversarial training is computationally expensive (requiring an inner maximization loop per training step) and may require more data, the paper shows an additional cost: a potential decrease in standard accuracy on unperturbed data. Empirical results on MNIST, CIFAR-10, and a restricted ImageNet dataset (Figure 1) demonstrate that while adversarial training can act as a regularizer and improve standard accuracy in low-data regimes, with sufficient training data, robust models consistently exhibit lower standard accuracy than standard models trained specifically for that objective. This suggests that the features learned for optimal standard performance are different from those needed for adversarial robustness.

To explain this phenomenon, the paper introduces a simple theoretical binary classification model (Equation 1) where data samples consist of a highly correlated feature ( $x_1$ ) and many weakly correlated features ( $x_2, \ldots, x_{d+1}$ ). In this model, features $x_2, \ldots, x_{d+1}$ are individually weak predictors but collectively informative. A standard classifier leverages these weakly correlated features (by effectively pooling them, like a weighted sum) to achieve very high standard accuracy ( $>99\%$ ). However, the paper shows that an $\ell_\infty$ -bounded adversary with a relatively small perturbation magnitude ( $\epsilon \ge 2\eta$ in their model) can easily manipulate these weakly correlated features, effectively flipping their predictive signal. The first feature ( $x_1$ ) is moderately correlated ( $p \ge 0.5$ ) and more robust to small $\ell_\infty$ perturbations.

The theoretical analysis (Theorem 3.1) demonstrates a provable trade-off: any classifier achieving high standard accuracy (near $100\%$ ) must rely heavily on the non-robust features, making it highly vulnerable to adversarial perturbations and thus having low robust accuracy. Conversely, a classifier that ignores these non-robust features and relies only on the robust one will achieve reasonable robust accuracy but will be limited by the predictive power of the robust feature, leading to lower standard accuracy (Theorem 3.2). This highlights that the optimal standard classifier and the optimal robust classifier learn fundamentally different feature representations.

Practical Implications of the Trade-off:

Model Selection: Developers must choose between a model optimized for maximum standard accuracy (e.g., for tasks where data distribution shift or adversarial attacks are not major concerns) and a model optimized for robustness (e.g., for safety-critical systems where security against malicious input is paramount), as achieving both perfectly may not be possible depending on the data and threat model.
Training Methods: Standard training methods inherently seek out any predictive feature, including those that are non-robust. Achieving robustness requires specific training methods like adversarial training (PGD) that explicitly account for worst-case perturbations. This validates the necessity of robust training research.
Computational Resources: Adversarial training is substantially more computationally intensive than standard training due to the inner optimization loop required to find adversarial examples. Implementing PGD adversarial training involves repeated forward and backward passes for each training sample in a batch to generate the perturbation before the model update step.
Data Requirements: As indicated by related work and the paper's findings in the infinite data limit, achieving robust generalization might require significantly more data than standard generalization.

Despite the trade-off, the paper identifies unexpected benefits of robust models, stemming from their reliance on features that are invariant to small, human-imperceptible perturbations:

Improved Interpretability via Gradients: Visualizing the gradient of the loss with respect to the input pixels reveals that robust models' gradients align well with perceptually salient features like edges and textures, which are meaningful to humans (Figure 2). In contrast, standard models' gradients appear noisy and random. This suggests that features learned by robust models are more aligned with human visual perception.
- Implementation: Gradient visualization involves computing the gradients of the loss function with respect to the input tensor (e.g., loss.backward() followed by accessing input.grad in PyTorch).
Meaningful Adversarial Examples: When generating adversarial examples with large perturbation budgets ( $\epsilon$ $ϵ$ much larger than used in training), robust models produce images that appear to humans as samples from a different target class (Figure 3). For standard models, these large perturbations result in noisy, distorted images that still resemble the original class. This further supports that robust models use features representative of object classes.
- Implementation: This involves running PGD with a large $\epsilon$ value on a test image and visualizing the resulting perturbed image.
Smooth Cross-Class Interpolations: Linearly interpolating between an original image and its large- $\epsilon$ $ϵ$ adversarial counterpart for a robust model creates smooth, perceptually plausible transitions between the original class and the target class (Figure 4). This behavior is similar to interpolations achieved by generative models like GANs, hinting at a connection between robust optimization and the structure of the data manifold.
- Implementation: Start with an original image $x_{orig}$ and a large- $\epsilon$ adversarial example $x_{adv}$ . Generate a sequence of images $x_\alpha = (1-\alpha)x_{orig} + \alpha x_{adv}$ for $\alpha \in [0, 1]$ .

Implementation Considerations & Trade-offs:

Choosing $\ell_p$ Norm and $\epsilon$ : The choice of the $\ell_p$ norm ( $\ell_\infty$ or $\ell_2$ are common) and the perturbation magnitude $\epsilon$ for adversarial training dictates the type of robustness and influences the trade-off. Different $\ell_p$ norms correspond to different threat models and lead to models relying on different sets of robust features.
Architecture: While the paper doesn't focus on architecture, empirical evidence suggests that some architectures might be more amenable to robust training or exhibit different trade-offs.
Balancing Objectives: Some subsequent research (beyond this paper) has explored methods to explicitly balance standard and robust accuracy, for example, by using combinations of standard and adversarial loss during training or by employing different regularization techniques.
Alternative to Adversarial Training: The paper's empirical paper on linear MNIST (Appendix C, Figure 5) suggests that training a standard model only on features deemed sufficiently correlated (and thus potentially more robust) might yield better robustness than standard training and sometimes even better than adversarial training, albeit at the cost of standard accuracy. This implies that identifying and leveraging intrinsically robust features could be an alternative implementation strategy in certain domains.

In summary, the paper provides a foundational insight into the potential conflict between pursuing standard accuracy and adversarial robustness, attributing it to the difference between features useful for standard classification (including non-robust ones) and those required for robustness (only robust ones). While robust training is necessary to achieve robustness in such settings, it may inherently limit peak standard accuracy. However, the features learned by robust models appear more aligned with human perception, offering potential benefits for interpretability and revealing interesting connections to generative models. This work highlights the need for careful consideration of task requirements and threat models when designing and implementing machine learning systems, as maximizing one objective might require sacrificing the other.