Emergent Mind

Abstract

This study addresses a critical gap in safety tuning practices for LLMs by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods (e.g., CodeAttack) that have jailbroken GPT-4 and LLaMA3-70B-Instruct. Our code and data can be found at https://github.com/RobustNLP/DeRTa.

MLE with Harmful Prefix vs. RTO: transition from harmful response to safety refusal.

Overview

  • The paper presents Decoupled Refusal Training (DeRTa), a novel methodology to improve the safety of LLMs by enabling them to refuse generating unsafe content at any response position.

  • Two key components of DeRTa, Maximum Likelihood Estimation (MLE) with Harmful Response Prefix and Reinforced Transition Optimization (RTO), are introduced to train LLMs more effectively.

  • Empirical results, including reduced Attack Success Rates (ASR), demonstrate DeRTa's effectiveness across various models and attack scenarios, signifying a substantial safety improvement without performance compromise.

Improving Safety in LLMs via Decoupled Refusal Training

The paper "Improving Safety in LLMs via Decoupled Refusal Training" addresses a critical safety issue in the deployment of LLMs. The primary focus of the paper is on mitigating refusal position bias in safety tuning practices, which affects LLMs' ability to refuse generating unsafe content effectively at any point in their responses.

Methodology

The proposed method, Decoupled Refusal Training (DeRTa), introduces two novel components to enhance the refusal capabilities of LLMs:

  1. Maximum Likelihood Estimation (MLE) with Harmful Response Prefix: This approach augments the training data by appending segments of harmful responses to the beginnings of safe responses. This strategy provides additional context for the query, enhancing the model's ability to identify and avoid unsafe content. Moreover, it allows the model to learn to refuse compliance at not just the start but at any position within the response.
  2. Reinforced Transition Optimization (RTO): This component aims to consistently train the LLM to refuse unsafe content throughout the response sequence. By simulating transitions from harmful to safe content at every possible position within the harmful response, the model is better equipped to recognize potential threats and halt their generation.

Evaluation and Results

The paper evaluates DeRTa using LLaMA3 and Mistral model families across various attack scenarios, such as CodeAttack and JailbreakChat. Empirical results demonstrate that DeRTa significantly improves the safety of LLMs without compromising their performance. Key findings include:

  • Reduction in Attack Success Rate (ASR): The method notably reduces ASR in multiple attack scenarios. For instance, in the case of Mistral-MoE models, the average ASR dropped from 79.1% to 8.7%.
  • Consistent Improvement Across Models: Both LLaMA3 (8B and 70B) and Mistral (7B and 8x7B) models showed improved safety when applying DeRTa, with LLaMA3-70B models seeing a reduction in ASR from 70.6% to 8.8%.
  • Against Advanced Attacks: The method proved effective against sophisticated attacks such as CodeAttack, which previously breached GPT-4 and LLaMA3-70B-Instruct protections.

Analysis and Implications

Addressing Refusal Position Bias: The data indicates that traditional safety tuning often positions refusal tokens at the beginning of responses. DeRTa's comprehensive approach, which incorporates harmful prefixes and RTO, ensures that refusal can occur at any point. This is crucial for defending against attacks that manipulate content mid-response.

Comparison to DPO: While DPO also utilizes both safe and harmful responses, DeRTa surpasses it in effectiveness, particularly in scenarios like CodeAttack. This suggests that DeRTa's explicit modeling of refusal across the response sequence provides a more robust safety mechanism.

Applicability Across Model Sizes: The method proves to be versatile, showing significant safety improvements across various model sizes, including smaller models like Mistral-7B and LLaMA3-8B.

Future Directions

The research opens several pathways for future investigations and developments:

  • Refinement of Training Data: Continued exploration of data augmentations and training methodologies to further ameliorate refusal biases can offer incremental safety improvements.
  • Integration with Other Safety Mechanisms: Investigating the integration of DeRTa with other existing safety defense strategies, such as safety prompts and input perturbations, can potentially fortify LLMs against a broader spectrum of attacks.
  • Robustness Against Evolving Attacks: As attack strategies evolve, continuous evaluation and adaptive training strategies will be crucial to maintaining the safety integrity of LLMs.

Conclusion

The paper puts forth a substantive advancement in the field of LLM safety by addressing refusal position bias through Decoupled Refusal Training. This method efficiently equips LLMs to halt the generation of unsafe content at any response position, marking a significant improvement over existing safety tuning practices. The empirical evidence supports the efficacy and necessity of such approaches in the ongoing endeavor to enhance the safety and reliability of LLMs in practical applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.