Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training (2407.09121v2)

Published 12 Jul 2024 in cs.CL and cs.AI

Abstract: This study addresses a critical gap in safety tuning practices for LLMs by identifying and tackling a refusal position bias within safety tuning data, which compromises the models' ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses baseline methods in defending against attacks.

Citations (10)

View on Semantic Scholar

Summary

The paper introduces DeRTa, a novel approach that trains LLMs to refuse unsafe content at any response position.
The method employs MLE with harmful response prefixes and Reinforced Transition Optimization to overcome traditional refusal position biases.
Empirical results show significant safety gains, with ASR reductions (e.g., from 79.1% to 8.7% in Mistral-MoE models) across varied LLM families.

Improving Safety in LLMs via Decoupled Refusal Training

The paper "Improving Safety in LLMs via Decoupled Refusal Training" addresses a critical safety issue in the deployment of LLMs. The primary focus of the paper is on mitigating refusal position bias in safety tuning practices, which affects LLMs' ability to refuse generating unsafe content effectively at any point in their responses.

Methodology

The proposed method, Decoupled Refusal Training (DeRTa), introduces two novel components to enhance the refusal capabilities of LLMs:

Maximum Likelihood Estimation (MLE) with Harmful Response Prefix: This approach augments the training data by appending segments of harmful responses to the beginnings of safe responses. This strategy provides additional context for the query, enhancing the model's ability to identify and avoid unsafe content. Moreover, it allows the model to learn to refuse compliance at not just the start but at any position within the response.
Reinforced Transition Optimization (RTO): This component aims to consistently train the LLM to refuse unsafe content throughout the response sequence. By simulating transitions from harmful to safe content at every possible position within the harmful response, the model is better equipped to recognize potential threats and halt their generation.

Evaluation and Results

The paper evaluates DeRTa using LLaMA3 and Mistral model families across various attack scenarios, such as CodeAttack and JailbreakChat. Empirical results demonstrate that DeRTa significantly improves the safety of LLMs without compromising their performance. Key findings include:

Reduction in Attack Success Rate (ASR): The method notably reduces ASR in multiple attack scenarios. For instance, in the case of Mistral-MoE models, the average ASR dropped from 79.1% to 8.7%.
Consistent Improvement Across Models: Both LLaMA3 (8B and 70B) and Mistral (7B and 8x7B) models showed improved safety when applying DeRTa, with LLaMA3-70B models seeing a reduction in ASR from 70.6% to 8.8%.
Against Advanced Attacks: The method proved effective against sophisticated attacks such as CodeAttack, which previously breached GPT-4 and LLaMA3-70B-Instruct protections.

Analysis and Implications

Addressing Refusal Position Bias: The data indicates that traditional safety tuning often positions refusal tokens at the beginning of responses. DeRTa's comprehensive approach, which incorporates harmful prefixes and RTO, ensures that refusal can occur at any point. This is crucial for defending against attacks that manipulate content mid-response.

Comparison to DPO: While DPO also utilizes both safe and harmful responses, DeRTa surpasses it in effectiveness, particularly in scenarios like CodeAttack. This suggests that DeRTa's explicit modeling of refusal across the response sequence provides a more robust safety mechanism.

Applicability Across Model Sizes: The method proves to be versatile, showing significant safety improvements across various model sizes, including smaller models like Mistral-7B and LLaMA3-8B.

Future Directions

The research opens several pathways for future investigations and developments:

Refinement of Training Data: Continued exploration of data augmentations and training methodologies to further ameliorate refusal biases can offer incremental safety improvements.
Integration with Other Safety Mechanisms: Investigating the integration of DeRTa with other existing safety defense strategies, such as safety prompts and input perturbations, can potentially fortify LLMs against a broader spectrum of attacks.
Robustness Against Evolving Attacks: As attack strategies evolve, continuous evaluation and adaptive training strategies will be crucial to maintaining the safety integrity of LLMs.

Conclusion

The paper puts forth a substantive advancement in the field of LLM safety by addressing refusal position bias through Decoupled Refusal Training. This method efficiently equips LLMs to halt the generation of unsafe content at any response position, marking a significant improvement over existing safety tuning practices. The empirical evidence supports the efficacy and necessity of such approaches in the ongoing endeavor to enhance the safety and reliability of LLMs in practical applications.

PDF Markdown

Related Papers

GitHub

GitHub - RobustNLP/DeRTa: A novel safety tuning method for large language models (51 stars)

Tweets

https://twitter.com/youliang_yuan/status/1812665889852121332

https://twitter.com/WenxiangJiao/status/1812691420928680243

https://twitter.com/fly51fly/status/1814049639017922775

https://twitter.com/tmdanis/status/1822590970107453899

https://twitter.com/ZeyuQin_alan/status/1839137182399046101

https://twitter.com/segyges/status/1836811880574107770