Emergent Mind

COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability

(2402.08679)
Published Feb 13, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Jailbreaks on LLMs have recently received increasing attention. For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with diverse attributes, such as contextual coherence and sentiment/stylistic variations, and hence it is beneficial to study controllable jailbreaking, i.e. how to enforce control on LLM attacks. In this paper, we formally formulate the controllable attack generation problem, and build a novel connection between this problem and controllable text generation, a well-explored topic of natural language processing. Based on this connection, we adapt the Energy-based Constrained Decoding with Langevin Dynamics (COLD), a state-of-the-art, highly efficient algorithm in controllable text generation, and introduce the COLD-Attack framework which unifies and automates the search of adversarial LLM attacks under a variety of control requirements such as fluency, stealthiness, sentiment, and left-right-coherence. The controllability enabled by COLD-Attack leads to diverse new jailbreak scenarios which not only cover the standard setting of generating fluent suffix attacks, but also allow us to address new controllable attack settings such as revising a user query adversarially with minimal paraphrasing, and inserting stealthy attacks in context with left-right-coherence. Our extensive experiments on various LLMs (Llama-2, Mistral, Vicuna, Guanaco, GPT-3.5) show COLD-Attack's broad applicability, strong controllability, high success rate, and attack transferability. Our code is available at https://github.com/Yu-Fangxu/COLD-Attack.

Overview

  • COLD-Attack introduces a novel framework for automating adversarial attacks on LLMs, focusing on the generation of controlled, stealthy, and semantically coherent prompts.

  • Utilizes Energy-based Constrained Decoding with Langevin Dynamics to generate adversarial prompts, offering improvements in stealthiness and fluency.

  • Successfully evaluated across various LLMs, showing superior controllability, less detectability, and strong transferability in adversarial prompt generation.

  • Highlights the need for robust defense mechanisms by demonstrating advanced techniques in generating controlled, stealthy adversarial attacks against LLMs.

COLD-Attack: Automating Adversarial LLM Jailbreaks with Controllable and Stealthy Methods

Introduction

The advent of jailbreaking techniques for LLMs has shed light on the vulnerabilities in these models, highlighting the importance of addressing potential safety concerns. Jailbreaking LLMs involves generating or modifying prompts in a way that the LLM produces outputs that violate predefined safety protocols. These methods are categorized into two main classes: white-box approaches, which leverage internal model knowledge, and black-box methods, which do not require such internal insights. While both strategies offer valuable insights into LLM robustness, there has been a pressing challenge in controlling the attributes of adversarial prompts, such as sentiment or fluency, to generate stealthy and semantically coherent attacks.

Exploring Controllability in White-Box Attacks

This paper introduces COLD-Attack, a framework that employs Energy-based Constrained Decoding with Langevin Dynamics for generating adversarial prompts with controlled attributes. Traditional methods like GCG often result in syntactically incongruent prompts or rely on simple perplexity filters that do not guarantee stealthiness. Unlike these earlier approaches, COLD-Attack combines the benefits of energy-based models and guided Langevin dynamics to search for adversarial attacks within a defined control space, enhancing both stealthiness and attack complexity without compromising on fluency or semantic coherence.

Methodology

COLD-Attack operationalizes the attack generation problem within the paradigm of energy-based models, where various constraints (e.g., fluency, sentiment) are formulated as energy functions. Through targeted Langevin dynamics sampling, it optimizes for prompts that minimize these energy functions, effectively navigating through the adversarial space with enhanced controllability. This approach diverges significantly from predecessors, offering a gradient-based optimization in a continuous logit space rather than relying on discrete token-level manipulations.

The Role of Energy Functions

Key to the success of COLD-Attack is the formulation of energy functions that encapsulate different aspects of controllability:

  • Fluency: Ensures the generated attacks are syntactically and semantically coherent, reducing the likelihood of detection by simple defense mechanisms.
  • Semantic Coherence and Sentiment Control: Maintains the semantic integrity of the attack related to the original prompt while enabling sentiment manipulation to craft more nuanced attacks.

Evaluation and Results

Extensive experiments across various LLMs and settings demonstrate the versatile applicability and superior controllability of COLD-Attack. It showcases broad applicability across LLMs like Llama-2, Mistral, and Vicuna, with a high success rate and strong transferability. Critically, COLD-Attack achieves significant improvements in fluency and stealthiness over existing methods, as evidenced by lower perplexity scores and higher ASRs under sentiment control scenarios.

Discussion and Future Directions

COLD-Attack's ability to generate controllable and stealthy adversarial prompts opens new avenues for assessing and improving LLM safety. It underscores the need for a multidimensional approach to LLM robustness that goes beyond simple perplexity filters or semantic coherence checks. As the arms race between attack and defense methodologies continues, frameworks like COLD-Attack offer a nuanced perspective on how adversarial attacks can be more sophisticated, controllable, and challenging to detect.

Contributions and Acknowledgments

This work offers a novel perspective on the automatic generation of controlled and stealthy adversarial prompts for LLMs, extending the boundaries of current understandings of LLM vulnerabilities and defenses. The research was supported by notable grants from NSF, denoting its significance in the broader AI safety research landscape.

Conclusion

COLD-Attack marks a significant step forward in the domain of LLM jailbreaking, presenting a methodologically sound, highly adaptable framework that enables the generation of controlled, stealthy adversarial prompts. It not only challenges existing defense mechanisms but also poses critical questions regarding the future of LLM development, safety, and alignment.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

Reddit