PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (2405.07932v2)

Published 13 May 2024 in cs.CL and cs.AI

Abstract: LLMs have shown success in many natural language processing tasks. Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks, leading to security risks and abuse of the models. One option to mitigate such risks is to augment the LLM with a dedicated "safeguard", which checks the LLM's inputs or outputs for undesired behaviour. A promising approach is to use the LLM itself as the safeguard. Nonetheless, baseline methods, such as prompting the LLM to self-classify toxic content, demonstrate limited efficacy. We hypothesise that this is due to domain shift: the alignment training imparts a self-censoring behaviour to the model ("Sorry I can't do that"), while the self-classify approach shifts it to a classification format ("Is this prompt malicious"). In this work, we propose PARDEN, which avoids this domain shift by simply asking the model to repeat its own outputs. PARDEN neither requires finetuning nor white box access to the model. We empirically verify the effectiveness of our method and show that PARDEN significantly outperforms existing jailbreak detection baselines for Llama-2 and Claude-2. Code and data are available at https://github.com/Ed-Zh/PARDEN. We find that PARDEN is particularly powerful in the relevant regime of high True Positive Rate (TPR) and low False Positive Rate (FPR). For instance, for Llama2-7B, at TPR equal to 90%, PARDEN accomplishes a roughly 11x reduction in the FPR from 24.8% to 2.0% on the harmful behaviours dataset.

Citations (8)

View on Semantic Scholar

Summary

The paper introduces the PARDEN method which uses model repetition to differentiate harmful outputs from safe ones.
It employs BLEU score comparisons to classify outputs, reducing false positive rates significantly for models like Llama-2 and Claude-2.
The approach mitigates domain shifts in AI safety and can be integrated into various LLM architectures to enhance overall reliability.

Defending LLMs: The PARDEN Repetition Defense Approach

Introduction

In recent years, LLMs have demonstrated impressive capabilities across a variety of NLP tasks. However, as these models become more sophisticated, so too do the methods for exploiting them. Researchers at the University of Oxford have developed a novel defense mechanism against such exploits, termed "PARDEN" (Safe-Proofing LLMs via a Repetition Defense). This method focuses on using the model to repeat its outputs to distinguish between benign and harmful content. Let’s break down the main ideas and findings of their research.

Why Defending LLMs is Essential

Before exploring the defense mechanism, it's important to understand why defending LLMs against adversarial attacks, known as "jailbreaks," is critical:

User Safety: Preventing harmful or undesirable outputs is essential to protect users.
Model Integrity: Ensuring that models cannot be easily exploited helps maintain the trust and reliability in these AI systems.

Despite rigorous safety measures, leading models like Llama-2 and Claude-2 are susceptible to jailbreaks, which can coerce these models into generating inappropriate content. Traditional defenses, which often ask the model to classify content as harmful or benign, tend to struggle due to the shift in context between training and application scenarios.

The PARDEN Approach

Concept: PARDEN circumvents the domain shift issue by asking the model to repeat its own output. If the repetition significantly deviates from the original output, the content is likely harmful.

Mechanism:

Output Generation: Let the LLM generate a response to a user prompt.
Repetition: Prompt the model to repeat its own output within a predefined format.
Comparison: Compute the BLEU score, a measure of how similar the repeated output is to the original. A high BLEU score indicates a benign repeat, while a low score suggests the model refuses to repeat harmful content.
Classification: Use a threshold on the BLEU score to classify the output as harmful or benign.

This method leverages the model’s training on self-censorship—responses that are safe to repeat are likely non-harmful.

Strong Numerical Results

PARDEN significantly outperforms existing defense mechanisms:

For Llama-2-7B: It reduced the false positive rate (FPR) from 24.8% to 2.0% at a true positive rate (TPR) of 90%.
For Claude-2.1: At a similar TPR, the FPR was reduced from 2.72% to just 1.09%.

These bold results indicate PARDEN's potential to drastically improve the safety measures for LLMs.

Practical and Theoretical Implications

Practically:

Enhanced Safety: PARDEN's repetition approach could be integrated into various applications, ensuring that the outputs remain safe and trustworthy.
Improved Performance: By focusing on output instead of input, this method is robust against direct manipulation attempts.

Theoretically:

Domain Shift Mitigation: This approach addresses the domain shift problem, which is a significant hurdle in many AI safety mechanisms.
Scalability: Repetition as a defense mechanism could be adapted to various LLM architectures without requiring significant retraining or finetuning.

Future Developments

The success of PARDEN opens up several avenues for future research:

Extended Research: Exploring more sophisticated methods of output comparison could further improve defense accuracy.
Application Scope: Investigating how PARDEN can be combined with other safety mechanisms to create a more comprehensive defense system.
Fine-Tuning Development: Adjusting the model's training process to better align with repetitive output tasks may further enhance efficacy.

Conclusion

The PARDEN approach introduces an innovative and effective method for defending LLMs against adversarial attacks. By leveraging the model's own ability to self-censor and the simple act of repeating its outputs, PARDEN provides a robust safeguard without requiring extensive modifications to the model itself. As advancements in AI continue, techniques like PARDEN will be crucial in maintaining the integrity and trustworthiness of these powerful systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EdwardZYZhang/status/1792542036383699285

https://twitter.com/GptMaestro/status/1792809626674884755

https://twitter.com/0xflashmine/status/1790575932022849893