Emergent Mind

Abstract

We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.

Breakdown of WildGuardMix data types and examples: Synthetic, In-the-Wild, and Annotator-written data.

Overview

  • WildGuard is a comprehensive moderation tool for LLMs, focusing on identifying harmful prompts, detecting harmful responses, and measuring the refusal rates of models, using a uniquely constructed dataset for robustness and multi-task learning.

  • The WildGuardMix dataset, encompassing synthetic, adversarial, and real-world prompts, ensures extensive coverage across 13 risk categories and includes 92K labeled examples, enhancing WildGuard's effectiveness in moderation tasks.

  • Evaluations demonstrate WildGuard's superiority in detecting harmful prompts and responses and its significant enhancement in refusal detection, reducing jailbreak attacks successfully while maintaining low refusal rates for benign content.

WildGuard: LLM Moderation Tool for Enhancing Safety and Refusal Detection

WildGuard introduces a state-of-the-art tool for moderating interactions involving LLMs. It uniquely addresses three key tasks in content moderation: identifying harmful user prompts, detecting harmful model responses, and determining the refusal rate of the model. These functions are essential for the safe deployment of LLMs in real-world applications where user interactions can vary widely in intent and content.

Dataset Construction and Model Training

To develop WildGuard, the authors created WildGuardMix, a comprehensive dataset including a training set (WildGuardTrain) and a test set (WildGuardTest). This dataset covers 13 risk categories and includes 92K labeled examples. The collected data blends synthetic prompts, human-annotated evaluations, and real-world interactions, capturing both benign and harmful queries in vanilla (direct) and adversarial forms. Such a diverse dataset ensures broad coverage and robustness in moderation capabilities.

WildGuard excels at leveraging its diverse training data for multi-task learning. The dataset construction involves:

  • Synthetic Harmful Prompts: Generated using a structured pipeline to ensure realistic and varied scenarios that challenge the moderation tool.
  • Adversarial and Vanilla Prompts: Including prompts crafted through state-of-the-art methods such as WildTeaming, ensuring the model can handle complex, adversarial user interactions.
  • Real-World Interactions: Extracted from datasets like LMSYS-Chat-1M and WildChat, ensuring real-world applicability.

The training process, which employs Mistral-7B-v0.3 as the base model, emphasizes multi-task learning, combining tasks of harmfulness and refusal detection into a unified framework. This approach optimizes the tool's accuracy across all tasks.

Evaluation and Results

WildGuard was evaluated on multiple benchmarks, including WildGuardTest and other public datasets like ToxicChat, Harmbench, and SafeRLHF. Key findings from these evaluations are:

  • Prompt Harmfulness: WildGuard outperforms existing models and matches GPT-4, particularly excelling in detecting harmfulness in adversarial prompts where precision is critical.
  • Response Harmfulness: Achieving similar or better performance compared to current state-of-the-art models.
  • Refusal Detection: Significantly improves refusal detection accuracy, reducing the gap with GPT-4 and outperforming other open models.

The tool was further validated through practical demonstrations, showing its efficacy in moderating human-LLM interactions. When integrated with LLM interfaces, WildGuard significantly reduced the success rate of jailbreak attacks (79.8% to 2.4%) while maintaining a low refusal-to-answer rate for benign prompts.

Implications and Future Directions

The development of WildGuard holds significant implications for both practical deployments and theoretical advancements in AI safety:

  • Practical Impact: Provides a robust, open-source alternative to costly, non-static API tools, making safe LLM deployment more accessible.
  • Theoretical Contributions: Enhances our understanding of multi-task learning in safety moderation, showcasing the benefits of a diverse and comprehensive training dataset.

Future research could extend WildGuard’s capabilities by integrating finer-grained classification of harmful content categories. Additionally, ongoing advancements in adversarial attack methods will necessitate continuous updates and expansions of the training data to maintain robustness.

In conclusion, WildGuard represents a significant advancement in LLM safety moderation, providing an effective, multi-task solution that bridges the gap between open-source tools and proprietary models like GPT-4. The release of WildGuard and its accompanying datasets is a valuable step towards democratizing safe and responsible AI applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.