COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (2402.08679v2)

Published 13 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Jailbreaks on LLMs have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent (suffix) attack with continuation constraint, but also allow us to address new controllable attack settings such as revising a user query adversarially with paraphrasing constraint, and inserting stealthy attacks in context with position constraint. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5, and GPT-4) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.

Citations (44)

View on Semantic Scholar

Summary

The paper presents an energy-based constrained decoding method leveraging Langevin dynamics to automate the generation of controllable and stealthy adversarial LLM prompts.
The paper achieves high fluency and semantic coherence in adversarial outputs by integrating multiple energy functions for sentiment and syntax control.
The paper demonstrates broad applicability across various LLMs, offering significant improvements in attack success rates and transferability compared to traditional methods.

COLD-Attack: Automating Adversarial LLM Jailbreaks with Controllable and Stealthy Methods

Introduction

The advent of jailbreaking techniques for LLMs has shed light on the vulnerabilities in these models, highlighting the importance of addressing potential safety concerns. Jailbreaking LLMs involves generating or modifying prompts in a way that the LLM produces outputs that violate predefined safety protocols. These methods are categorized into two main classes: white-box approaches, which leverage internal model knowledge, and black-box methods, which do not require such internal insights. While both strategies offer valuable insights into LLM robustness, there has been a pressing challenge in controlling the attributes of adversarial prompts, such as sentiment or fluency, to generate stealthy and semantically coherent attacks.

Exploring Controllability in White-Box Attacks

This paper introduces COLD-Attack, a framework that employs Energy-based Constrained Decoding with Langevin Dynamics for generating adversarial prompts with controlled attributes. Traditional methods like GCG often result in syntactically incongruent prompts or rely on simple perplexity filters that do not guarantee stealthiness. Unlike these earlier approaches, COLD-Attack combines the benefits of energy-based models and guided Langevin dynamics to search for adversarial attacks within a defined control space, enhancing both stealthiness and attack complexity without compromising on fluency or semantic coherence.

Methodology

COLD-Attack operationalizes the attack generation problem within the paradigm of energy-based models, where various constraints (e.g., fluency, sentiment) are formulated as energy functions. Through targeted Langevin dynamics sampling, it optimizes for prompts that minimize these energy functions, effectively navigating through the adversarial space with enhanced controllability. This approach diverges significantly from predecessors, offering a gradient-based optimization in a continuous logit space rather than relying on discrete token-level manipulations.

The Role of Energy Functions

Key to the success of COLD-Attack is the formulation of energy functions that encapsulate different aspects of controllability:

Fluency: Ensures the generated attacks are syntactically and semantically coherent, reducing the likelihood of detection by simple defense mechanisms.
Semantic Coherence and Sentiment Control: Maintains the semantic integrity of the attack related to the original prompt while enabling sentiment manipulation to craft more nuanced attacks.

Evaluation and Results

Extensive experiments across various LLMs and settings demonstrate the versatile applicability and superior controllability of COLD-Attack. It showcases broad applicability across LLMs like Llama-2, Mistral, and Vicuna, with a high success rate and strong transferability. Critically, COLD-Attack achieves significant improvements in fluency and stealthiness over existing methods, as evidenced by lower perplexity scores and higher ASRs under sentiment control scenarios.

Discussion and Future Directions

COLD-Attack's ability to generate controllable and stealthy adversarial prompts opens new avenues for assessing and improving LLM safety. It underscores the need for a multidimensional approach to LLM robustness that goes beyond simple perplexity filters or semantic coherence checks. As the arms race between attack and defense methodologies continues, frameworks like COLD-Attack offer a nuanced perspective on how adversarial attacks can be more sophisticated, controllable, and challenging to detect.

Contributions and Acknowledgments

This work offers a novel perspective on the automatic generation of controlled and stealthy adversarial prompts for LLMs, extending the boundaries of current understandings of LLM vulnerabilities and defenses. The research was supported by notable grants from NSF, denoting its significance in the broader AI safety research landscape.

Conclusion

COLD-Attack marks a significant step forward in the domain of LLM jailbreaking, presenting a methodologically sound, highly adaptable framework that enables the generation of controlled, stealthy adversarial prompts. It not only challenges existing defense mechanisms but also poses critical questions regarding the future of LLM development, safety, and alignment.

Related Papers

Tweets

https://twitter.com/topofmlsafety/status/1759604755318407332

https://twitter.com/StephenLCasper/status/1780370647564325057

https://twitter.com/seclink/status/1757753724439814150

https://twitter.com/gm8xx8/status/1766745938872135876

https://twitter.com/0xlugel/status/1815968004519170331

https://twitter.com/fellowcreative/status/1811907941982749053

YouTube

Show All Videos

Reddit

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (2 points, 0 comments)