Emergent Mind

Abstract

Evaluating aligned LLMs' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 45 potentially unsafe topics, and 450 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 40 proprietary and open-source LLMs on SORRY-Bench, analyzing their distinctive refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner.

Benchmark results of 40+ LLMs ranked by fulfillment rates across 45 safety categories.

Overview

  • SORRY-Bench introduces a detailed and nuanced framework for evaluating safety refusal behaviors in LLMs, addressing previous limitations by offering a more comprehensive taxonomy of harmful instructions and integrating diverse linguistic formats.

  • The benchmark synthesizes and balances data from 10 prior benchmarks, encompassing 450 class-balanced unsafe instructions augmented by 20 linguistic variations, to ensure robust evaluation against sophisticated prompt engineering.

  • SORRY-Bench's evaluations reveal significant variations in refusal behaviors across 43 LLMs, with models like Claude-2 and Gemini-1.5 demonstrating strong refusal rates, thus offering critical insights into the strictness and effectiveness of safety policies in different LLMs.

Systematically Evaluating LLM Safety Refusal Behaviors with SORRY-Bench

SORRY-Bench presents a comprehensive framework for evaluating the safety refusal behaviors of LLMs. Developed to address limitations in existing methods, this benchmark prioritizes nuanced and granular analysis over a broad spectrum of potentially unsafe topics. This paper systematically constructs a more balanced and fine-grained taxonomy of harmful instructions, integrates diverse linguistic formats, and proposes efficient automated safety evaluators. The results from evaluating 40+ LLMs offer both a detailed insight into their refusal behaviors and a robust methodology for future developments in AI safety.

Fine-Grained and Diverse Evaluation Taxonomy

SORRY-Bench's taxonomy categorizes potentially unsafe instructions into 45 classes, spanning four high-level domains: Hate Speech Generation, Assistance with Crimes or Torts, Potentially Inappropriate Topics, and Potentially Unqualified Advice. This granularity addresses the coarse definitions commonly found in prior datasets, where broad categories often obfuscate specific risks. Noteworthy is the systematic approach to taxonomy development, employing a human-in-the-loop methodology to refine and ensure comprehensive coverage.

Dataset Collection and Balance

To construct a balanced dataset, the authors synthesized and expanded upon 10 prior benchmarks, creating a total of 450 class-balanced unsafe instructions. This effort mitigates the over-representation of certain categories noted in previous work, such as "Fraud" and "Sexual Explicit Content Generation," and ensures underrepresented but critical categories like "Animal-related Crimes" and "Self-Harm" are sufficiently covered.

Linguistic Mutations and Diversity

Addressing the variability in user prompts, SORRY-Bench includes 20 linguistic augmentations, expanding the dataset by 9,000 instructions. These mutations cover variations such as different languages, dialects, writing styles, and encoding strategies. By decoupling linguistic characteristics from content, the benchmark evaluates LLMs' capabilities to recognize and refuse unsafe prompts across diverse formats, ensuring robustness against sophisticated prompt engineering.

Efficient and Accurate Automated Evaluators

SORRY-Bench advances the methodology for automated safety evaluation by conducting a meta-evaluation on human annotations, collected to form a 7,200-record dataset. Various design choices for LLM-based evaluators were compared, revealing that fine-tuned smaller models (e.g., 7B parameters) can achieve accuracy comparable to larger models like GPT-4, but with significantly lower computational costs. The chosen judge, fine-tuned Mistral-7b-instruct-v0.2, for instance, strikes a balance with over 80% agreement with human evaluators and an evaluation time of approximately 10 seconds per pass.

Benchmark Results and Implications

The evaluation of 43 LLMs on SORRY-Bench reveals significant variations in refusal behaviors. Claude-2 and Gemini-1.5 models exhibit the strongest refusal behaviors, with fulfillment rates under 10%. On the contrary, models like Mistral series have notably higher fulfillment rates, exceeding 50%. Such discrepancies highlight the diverse safety policies and alignment goals pursued by different model creators. Analyzing these results provides critical insights into the adherence to safety standards across industry and open-source models.

The study also underscores the dynamic nature of LLM safety, as seen in the temporal analysis of models like GPT-4 and Llama series. Changes in fulfillment rates across model versions reflect the evolving strategies of model developers in response to emerging safety challenges and regulatory guidelines.

Evaluating the Impact of Linguistic Diversity

The impact of linguistic mutations showed that specific styles and formats (e.g., technical terms, persuasion techniques) significantly affect compliance rates. In contrast, encoding and encryption transformations generally decreased fulfillment rates, as models often failed to comprehensively decode these requests. These findings emphasize the necessity for LLMs to robustly handle diverse and complex prompt formats to ensure comprehensive safety refusal.

Future Directions and Conclusion

SORRY-Bench provides a crucial foundation for refining LLM safety evaluations. However, the study acknowledges areas for further research, such as evaluating multi-risk scenarios and ensuring continuous updates to encompass evolving safety standards. Future enhancements may include integrating advanced jailbreaking techniques and extending datasets to capture new emerging threats.

In conclusion, SORRY-Bench offers a rigorous, granular, and balanced approach to assessing LLM safety refusal behaviors. It serves as an invaluable tool for researchers and practitioners, enabling systematic improvements and ensuring safer, more robust AI deployments.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.