Rethinking Model Ensemble in Transfer-based Adversarial Attacks

Published 16 Mar 2023 in cs.CV | (2303.09105v2)

Abstract: It is widely recognized that deep learning models lack robustness to adversarial examples. An intriguing property of adversarial examples is that they can transfer across different models, which enables black-box attacks without any knowledge of the victim model. An effective strategy to improve the transferability is attacking an ensemble of models. However, previous works simply average the outputs of different models, lacking an in-depth analysis on how and why model ensemble methods can strongly improve the transferability. In this paper, we rethink the ensemble in adversarial attacks and define the common weakness of model ensemble with two properties: 1) the flatness of loss landscape; and 2) the closeness to the local optimum of each model. We empirically and theoretically show that both properties are strongly correlated with the transferability and propose a Common Weakness Attack (CWA) to generate more transferable adversarial examples by promoting these two properties. Experimental results on both image classification and object detection tasks validate the effectiveness of our approach to improving the adversarial transferability, especially when attacking adversarially trained models. We also successfully apply our method to attack a black-box large vision-LLM -- Google's Bard, showing the practical effectiveness. Code is available at \url{https://github.com/huanranchen/AdversarialAttacks}.

Abstract PDF HTML Upgrade to Chat

Authors (6)

References (85)

Citations (39)

View on Semantic Scholar

Summary

The paper presents a novel Common Weakness Attack (CWA) that targets ensemble weaknesses by focusing on loss landscape flatness and local optima.
The method integrates Sharpness Aware Minimization and Cosine Similarity Encourager, increasing attack success rates by up to 30%.
Empirical validation on 31 victim models, including adversarially trained and vision-language systems, demonstrates the approach's robust performance.

Rethinking Model Ensemble in Transfer-based Adversarial Attacks

This paper, authored by Chen et al., presents an in-depth analysis of model ensemble methods in the context of transfer-based adversarial attacks. It addresses a significant gap in the understanding of how and why model ensemble methods enhance transferability. The authors propose a novel attack method, termed Common Weakness Attack (CWA), which targets common weaknesses across models to generate more transferable adversarial examples.

Problem Statement and Motivation

Deep neural networks are known to be vulnerable to adversarial examples, which are inputs modified by subtle perturbations that can mislead the model's predictions. The transferability of these adversarial examples across different models can facilitate black-box attacks, where the adversary has no direct access to the victim model. Traditionally, ensemble methods improve transferability by averaging outputs from various models. However, there has been limited exploration of the underlying principles that make this approach effective.

Methodological Contributions

The key contribution is the introduction of the concept of common weaknesses in model ensembles. The authors propose two properties that characterize these weaknesses:

Flatness of the Loss Landscape: A flatter loss landscape indicates better generalization and transferability.
Closeness to Local Optima of Each Model: The proximity of the adversarial example to local optima of each surrogate model boosts transferability.

To optimize these properties, the authors present the Common Weakness Attack (CWA) by combining two sub-methods:

Sharpness Aware Minimization (SAM): Aimed at flattening the loss landscape to improve generalization.
Cosine Similarity Encourager (CSE): Encourages closeness to local optima by maximizing cosine similarity of gradients between models.

The integration of these methods into existing attacks, such as momentum iterative methods, results in enhanced adversarial transferability.

Empirical Validation

The effectiveness of CWA is validated through experiments on image classification, object detection, and an innovative test on a large vision-LLM. The method significantly improves attack success rates across 31 diverse victim models. Notably, against adversarially trained models and state-of-the-art defenses, CWA shows a marked increase in attack success rates—by as much as 30% in some cases. These results underscore the importance of targeting common weaknesses in adversarial attack strategies.

Implications and Future Directions

The proposed CWA method is robust across various tasks and models, highlighting its potential as a tool for evaluating model robustness. The insights gained from the study of common weaknesses can inform the development of more resilient defense strategies. Moreover, the versatility of the CWA algorithm suggests its applicability in areas beyond adversarial attacks, potentially influencing the design of models with improved generalization capabilities.

In future research, the exploration of adaptive defense mechanisms that can identify and guard against attacks exploiting common weaknesses will be essential. Additionally, as AI systems are increasingly deployed in safety-critical applications, continual enhancements in understanding model vulnerabilities will remain a critical area of investigation. This work contributes a foundational understanding of ensemble methods in adversarial contexts, providing a basis for future advancements in both attack and defense strategies in machine learning.

Markdown Report Issue