Emergent Mind

Abstract

Jailbreaking is an emerging adversarial attack that bypasses the safety alignment deployed in off-the-shelf LLMs. A considerable amount of research exists proposing more effective jailbreak attacks, including the recent Greedy Coordinate Gradient (GCG) attack, jailbreak template-based attacks such as using "Do-Anything-Now" (DAN), and multilingual jailbreak. In contrast, the defensive side has been relatively less explored. This paper proposes a lightweight yet practical defense called SELFDEFEND, which can defend against all existing jailbreak attacks with minimal delay for jailbreak prompts and negligible delay for normal user prompts. Our key insight is that regardless of the kind of jailbreak strategies employed, they eventually need to include a harmful prompt (e.g., "how to make a bomb") in the prompt sent to LLMs, and we found that existing LLMs can effectively recognize such harmful prompts that violate their safety policies. Based on this insight, we design a shadow stack that concurrently checks whether a harmful prompt exists in the user prompt and triggers a checkpoint in the normal stack once a token of "No" or a harmful prompt is output. The latter could also generate an explainable LLM response to adversarial prompts. We demonstrate our idea of SELFDEFEND works in various jailbreak scenarios through manual analysis in GPT-3.5/4. We also list three future directions to further enhance SELFDEFEND.

Overview

  • The paper introduces SelfDefend, a defense mechanism against LLM jailbreaking, using a dual-stack architecture to identify and mitigate harmful prompts.

  • SelfDefend aims to counteract advanced jailbreak methods like GCG attacks, template-based jailbreaks, and multilingual approaches with minimal latency for users.

  • Evaluation on models like GPT-3.5 and GPT-4 shows SelfDefend can effectively prevent the generation of harmful content without significant performance impact.

  • Future developments for SelfDefend include creating a dedicated LLM for prompt identification, leveraging adversarial examples for better model alignment, and a caching mechanism to enhance efficiency.

SelfDefend: A Practical Defense Against LLM Jailbreaking

Introduction to Jailbreaking and Existing Defenses

Jailbreaking in the context of LLMs refers to adversarial tactics that circumvent the safety mechanisms installed in these models to prevent them from generating harmful or unethical content. This has led to an arms race between the development of jailbreak techniques and the formulation of defenses to counteract these attacks. The landscape of jailbreak tactics has evolved significantly, introducing sophisticated methods like Greedy Coordinate Gradient (GCG) attacks, template-based jailbreaks including "Do-Anything-Now" (DAN), and multilingual approaches. In contrast, the development of robust defenses against these jailbreaks has not been as rapid or explored in depth.

SelfDefend Mechanism

The paper introduces SelfDefend, a novel defense mechanism poised to address the growing concerns over jailbreaking of LLMs. SelfDefend represents a lightweight, practical solution capable of defending against various jailbreak strategies with minimal latency implications for end-users. At its core, SelfDefend leverages the innate ability of current LLMs to recognize potentially harmful prompts that may violate their safety protocols. This is achieved through a dual-stack architecture, comprising a "normal" stack processing user prompts and a "shadow" stack running in parallel to identify any harmful content within these prompts. Upon detection of such content, a checkpoint mechanism is triggered, enabling the model to respond appropriately to the adversarial prompts while providing an explainable output regarding the nature of the blockage.

Performance and Practical Applications

The efficacy of SelfDefend was assessed through a series of manual tests conducted on popular models like GPT-3.5 and GPT-4. These evaluations covered a span of jailbreak categories, including GCG, template-based, and multilingual jailbreaks. Results indicate that SelfDefend successfully identifies and mitigates harmful content across all test scenarios without inducing significant delays for normal user prompts. This demonstrates the potential of SelfDefend to uphold the safety and integrity of LLM responses without compromising on responsiveness or user experience.

Future Directions and Enhancements

While promising, SelfDefend’s methodology invites further exploration and refinement for broader applicability and robustness against evolving jailbreak strategies. Proposed future endeavors include:

  • Developing a more cost-efficient and faster LLM dedicated to the accurate identification of harmful prompts, thereby enhancing the overall performance of SelfDefend.
  • Exploring the use of the identified adversarial examples (AEs) to fortify the alignment and safety mechanisms within LLMs, leveraging these insights to detect and negate future jailbreak attempts more effectively.
  • Implementing a caching mechanism within the shadow stack to optimize the processing pipeline, reducing redundancies in prompt checks.

Comparative Analysis and Novel Contributions

Compared to existing defenses, which predominantly focus on either tuning-based or non-tuning-based strategies, SelfDefend introduces a unique checkpoint mechanism coupled with a shadow stack design. This approach not only affords minimal latency but also delivers a robust defense against a wide spectrum of jailbreak strategies without necessitating modifications to the LLM’s core architecture. This stands in contrast to methods like IAPrompt, which, while also focusing on input analysis, may not effectively counter sophisticated jailbreak attempts embedded within benign-looking prompts.

Conclusion

In summation, the SelfDefend framework presents a comprehensive, practical solution to the persistent challenge of LLM jailbreaking. Through its innovative use of parallel processing and checkpoint mechanisms, it offers a scalable, effective defense capable of adapting to the evolving landscape of adversarial attacks on LLMs. As such, it marks a significant step forward in the ongoing effort to safeguard the ethical use and deployment of LLMs across diverse application domains.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.