AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (2310.15140v2)

Published 23 Oct 2023 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: Safety alignment of LLMs can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these prompts, generated from scratch using gradients, are interpretable and diverse, with emerging strategies commonly seen in manual jailbreak attacks. They also generalize to unforeseen harmful behaviors and transfer to black-box LLMs better than their unreadable counterparts when using limited training data or a single proxy model. Furthermore, we show the versatility of AutoDAN by automatically leaking system prompts using a customized objective. Our work offers a new way to red-team LLMs and understand jailbreak mechanisms via interpretability.

References (61)

Citations (30)

View on Semantic Scholar

Summary

The paper introduces AutoDAN, a novel gradient-based framework that optimizes token sequences to balance jailbreak objectives with human readability.
It employs a two-stage process—preliminary and fine selection—to generate coherent adversarial prompts that successfully bypass defenses on models like Vicuna-7B.
The results underscore the need for enhanced LLM defenses and pave the way for more robust, context-aware security strategies against adversarial attacks.

Analysis of "AutoDAN: Interpretable Gradient-Based Adversarial Attacks on LLMs"

The paper evaluates vulnerabilities associated with LLMs to both manual and automatic adversarial attacks, emphasizing how safety alignment is often inadequate. It challenges existing notions that current detection and mitigation strategies for adversarial attacks are effective by introducing a novel approach, AutoDAN. This approach blends the interpretability and syntactic sophistication of manual attacks with the automated scalability of gradient-based attacks.

Conceptual Framework and Methodology

AutoDAN—short for Automatically Do-Anything-Now—stands out as an interpretable, gradient-based adversarial attack strategy optimized for readability and efficiency in compromising LLMs. Unlike its predecessors, which produce unreadable gibberish, AutoDAN generates adversarial sequences that pass perplexity-based filters, retaining human readability and coherence. AutoDAN operates by generating token sequences iteratively: optimizing one token at a time, from left to right, while maintaining a balance between two core objectives—jailbreaking the model and ensuring the sequence remains within human-sensible syntax.

The paper describes a two-stage optimization framework: preliminary selection, which narrows down a list of candidate tokens by combining gradients of the jailbreak objective and the readability likelihood, followed by fine selection, which further refines this selection using a weighted combination of the two earlier mentioned objectives. Token selection adapts dynamically to entropy variations across tokens, modulating the weight of the jailbreak objective relative to the context's importance.

Results and Implications

Empirical results underscore the efficacy of AutoDAN in bypassing existing defenses. AutoDAN achieves high attack success rates against models like Vicuna-7B, Guanaco, and Pythia-12B, even with synthetic perplexity-based defenses in place. It ensures the generated prompts are not just successful in jailbreaking but also semantically coherent, thus avoiding detection that the current defenses would typically rely on.

The research notes emergent strategies within AutoDAN prompts such as "Shifting Domains" and "Detailizing Instructions," tactics that naturally align with human-crafted jailbreak strategies. This reflects an understanding of how LLMs interpret context and emphasizes the need for more robust defenses that consider the nuanced nature of adversarial attacks beyond mere gibberish detection.

Broader Impact and Future Directions

AutoDAN illustrates potential weaknesses in current LLM protection mechanisms and suggests that model creators should explore defense strategies beyond simple filtering and blacklisting of known attack vectors. The paper also expands on the utility of AutoDAN in tasks like prompt leaking, thus arguing for its versatility in evaluating other vulnerable points within LLM deployments.

Future Trajectories: Developing more sophisticated defenses, such as embedding self-awareness and contextual understanding within LLMs, can be a critical enhancement. Moreover, given AutoDAN’s adaptability and effectiveness, enhancing LLMs’ understanding of complex, multifaceted security scenarios could form a layer of defense beyond classical security metrics like perplexity. These insights could be foundational in guiding AI safety research and generating resilient architectures against evolving adversarial techniques in the AI landscape.

This paper encourages a paradigm shift toward focusing on model robustness against intelligently crafted adversarial inputs, advocating a move away from traditional, reactionary security measures toward innovative, preemptive defenses. In summary, AutoDAN's approach marks a significant stride in highlighting and exploiting gaps in LLM safety, signaling a call to action in bolstering AI security frameworks effectively.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bang_an_/status/1811188947013308559

https://twitter.com/bang_an_/status/1754980570424127845

https://twitter.com/bang_an_/status/1754981817680081030

https://twitter.com/bang_an_/status/1754577473759519189

https://twitter.com/francescofaenzi/status/1790419304925843676

https://twitter.com/StephenLCasper/status/1780370624646623552

YouTube

Show All Videos