Failures to Find Transferable Image Jailbreaks Between Vision-Language Models (2407.15211v2)

Published 21 Jul 2024 in cs.CL, cs.AI, cs.CR, cs.CV, and cs.LG

Abstract: The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-LLMs (VLMs) that generate text outputs conditioned on visual and textual inputs. We conducted a large-scale empirical study to assess the transferability of gradient-based universal image jailbreaks" using a diverse set of over 40 open-parameter VLMs, including 18 new VLMs that we publicly release. Overall, we find that transferable gradient-based image jailbreaks are extremely difficult to obtain. When an image jailbreak is optimized against a single VLM or against an ensemble of VLMs, the jailbreak successfully jailbreaks the attacked VLM(s), but exhibits little-to-no transfer to any other VLMs; transfer is not affected by whether the attacked and target VLMs possess matching vision backbones or LLMs, whether the LLM underwent instruction-following and/or safety-alignment training, or many other factors. Only two settings display partially successful transfer: between identically-pretrained and identically-initialized VLMs with slightly different VLM training data, and between different training checkpoints of a single VLM. Leveraging these results, we then demonstrate that transfer can be significantly improved against a specific target VLM by attacking larger ensembles ofhighly-similar" VLMs. These results stand in stark contrast to existing evidence of universal and transferable text jailbreaks against LLMs and transferable adversarial attacks against image classifiers, suggesting that VLMs may be more robust to gradient-based transfer attacks.

Authors (15)

Rylan Schaeffer (33 papers)
Dan Valentine (4 papers)
Luke Bailey (7 papers)
James Chua (6 papers)
Cristóbal Eyzaguirre (14 papers)
Zane Durante (12 papers)
Joe Benton (12 papers)
Brando Miranda (23 papers)
Henry Sleight (10 papers)
John Hughes (32 papers)
Rajashree Agrawal (6 papers)
Mrinank Sharma (17 papers)
Scott Emmons (21 papers)
Sanmi Koyejo (111 papers)
Ethan Perez (55 papers)

Citations (6)

View on Semantic Scholar

Summary

The paper shows that image jailbreaks optimized for one VLM are largely non-transferable to others, emphasizing inherent model resilience.
Partial transfer occurs under conditions like identically-initialized models with additional training and checkpoint variations.
Larger ensembles of highly similar VLMs enhance jailbreak success, indicating that training similarity plays a key role in adversarial transfer.

An In-Depth Review of "When Do Universal Image Jailbreaks Transfer Between Vision-LLMs?"

This essay aims to provide a comprehensive summary and analysis of the paper "When Do Universal Image Jailbreaks Transfer Between Vision-LLMs?" The authors conducted a large-scale empirical paper to explore the transferability of image-based jailbreaks optimized against Vision-LLMs (VLMs). While the existing body of research has demonstrated the vulnerabilities of LLMs and image classifiers to transfer attacks, this investigation focuses particularly on VLMs, highlighting critical insights into their robustness against such adversarial manipulations.

Core Findings

The core contributions and findings of this work are multi-faceted and can be categorized as follows:

Universal but Non-Transferable Jailbreaks: The paper reveals that image jailbreaks optimized against a single VLM or an ensemble of VLMs tend to be universal but poorly transferable to other VLMs. This behavior was consistent across all factors considered, including shared vision backbones, shared LLMs, and whether the targeted VLMs underwent instruction-following or safety-alignment training.
Partial Transfer in Specific Settings: Two specific settings displayed partial success in transferring image jailbreaks: (i) between identically-initialized VLMs trained with additional training data, and (ii) between different training checkpoints of the same VLM. The partial transfer observed suggests that slight changes in training data or additional training can influence the transferability of adversarial images to some extent.
Lack of Transfer to Differently-Trained VLMs: The paper found no successful transfer when attacking identically-initialized VLMs trained in one-stage vs. two-stage finetuning. This strongly implies that the mechanism by which visual outputs are integrated into the LLM is a critical determinant of successful transfer.
Increased Success with Larger Ensembles of Similar VLMs: The final experiment demonstrated that attacking larger ensembles of "highly similar" VLMs significantly improved the transferability of image jailbreaks to a specific target VLM. This result underscores the importance of high similarity among the ensemble members for obtaining better transfer performance.

Methodology

The authors employed a robust methodology to optimize and evaluate image jailbreaks:

Harmful-Yet-Helpful Text Datasets: Three different datasets (AdvBench, Anthropic HHH, and Generated) were used to optimize image jailbreaks, each contributing unique prompt-response pairs involving harmful and helpful scenarios.
Loss Function: The negative log-likelihood that a set of VLMs would output a harmful yet helpful response given harmful prompts and the image was minimized.
Vision-LLMs (VLMs): The Prismatic suite of VLMs formed the primary experimental base, with 18 new VLMs also being created to span a broad range of language backbones and vision models.
Measuring Jailbreak Success: Cross-Entropy loss and evaluation through Claude 3 Opus' Harmful-Yet-Helpful score were used as primary metrics to assess the efficacy and transferability of the image jailbreaks.

Practical and Theoretical Implications

Practical Implications

The work highlights the robustness of VLMs to gradient-based transfer attacks compared to their unimodal counterparts like LLMs and image classifiers. The findings indicate that existing VLM systems possess inherent resilience to such adversarial manipulations, which has significant implications for the deployment of these models in real-world applications where security and robustness are paramount.

However, the partial success in transferring jailbreaks among "highly similar" VLMs suggests a potential avenue for improving adversarial training and defense techniques. Understanding the conditions under which jailbreaks might partially transfer can guide the development of more robust VLM systems that can withstand a broader spectrum of attack vectors.

Theoretical Implications and Future Directions

The robustness of VLMs to transfer attacks suggests a fundamental difference in how multimodal models process disparate types of input, compared to unimodal models. This robustness raises intriguing questions about the integrative mechanisms that could provide resilience against such attacks. Future research should focus on mechanistically understanding the activations or circuits within these models, particularly how visual and textual features are integrated and aligned.

Several potential directions for future research emerge from this work:

Mechanistic Study of VLM Robustness: Detailed investigations into the internal mechanisms of VLMs to understand better how visual and textual inputs are processed and integrated.
Development of More Effective Transfer Attacks: Exploration of sophisticated and computationally intensive attack strategies that might yield more transferable image jailbreaks.
Detection and Defense Mechanisms: Development of efficient techniques for detecting and mitigating image-based jailbreak attempts, ensuring VLMs remain secure and robust in varied operational settings.
Improving Safety-Alignment Training: Continued efforts to enhance safety-alignment training for VLMs to protect against adversarial inputs even more effectively.

Conclusion

In conclusion, this paper represents a significant effort to systematize and deepen our understanding of the transferability of image jailbreaks in VLMs. While the results demonstrate an impressive level of robustness, they also identify areas where adversarial attacks can find leverage, particularly among highly similar models. This work will undoubtedly spur further research aimed at both understanding and improving the adversarial resilience of multimodal AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RylanSchaeffer/status/1818308718330630194

https://twitter.com/RylanSchaeffer/status/1820175292524019738

https://twitter.com/FSFG/status/1815837372229054834

https://twitter.com/gm8xx8/status/1815594739791524248

https://twitter.com/mctalentowen/status/1817826219632869760

https://twitter.com/GptMaestro/status/1818661327499563476