SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal (2406.14598v2)

Published 20 Jun 2024 in cs.AI

Abstract: Evaluating aligned LLMs' (LLMs) ability to recognize and reject unsafe user requests is crucial for safe, policy-compliant deployments. Existing evaluation efforts, however, face three limitations that we address with SORRY-Bench, our proposed benchmark. First, existing methods often use coarse-grained taxonomies of unsafe topics, and are over-representing some fine-grained topics. For example, among the ten existing datasets that we evaluated, tests for refusals of self-harm instructions are over 3x less represented than tests for fraudulent activities. SORRY-Bench improves on this by using a fine-grained taxonomy of 44 potentially unsafe topics, and 440 class-balanced unsafe instructions, compiled through human-in-the-loop methods. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. We supplement SORRY-Bench with 20 diverse linguistic augmentations to systematically examine these effects. Third, existing evaluations rely on large LLMs (e.g., GPT-4) for evaluation, which can be computationally expensive. We investigate design choices for creating a fast, accurate automated safety evaluator. By collecting 7K+ human annotations and conducting a meta-evaluation of diverse LLM-as-a-judge designs, we show that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4 scale LLMs, with lower computational cost. Putting these together, we evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. We hope our effort provides a building block for systematic evaluations of LLMs' safety refusal capabilities, in a balanced, granular, and efficient manner. Benchmark demo, data, code, and models are available through https://sorry-bench.github.io.

Citations (26)

View on Semantic Scholar

Summary

The paper introduces a fine-grained taxonomy that categorizes unsafe instructions into 45 classes across four domains.
It synthesizes 450 balanced unsafe instructions with 9,000 linguistic mutations to rigorously evaluate over 40 LLMs.
Results reveal significant safety disparities, with some models showing <10% fulfillment rates and others exceeding 50%.

Systematically Evaluating LLM Safety Refusal Behaviors with SORRY-Bench

SORRY-Bench presents a comprehensive framework for evaluating the safety refusal behaviors of LLMs. Developed to address limitations in existing methods, this benchmark prioritizes nuanced and granular analysis over a broad spectrum of potentially unsafe topics. This paper systematically constructs a more balanced and fine-grained taxonomy of harmful instructions, integrates diverse linguistic formats, and proposes efficient automated safety evaluators. The results from evaluating 40+ LLMs offer both a detailed insight into their refusal behaviors and a robust methodology for future developments in AI safety.

Fine-Grained and Diverse Evaluation Taxonomy

SORRY-Bench's taxonomy categorizes potentially unsafe instructions into 45 classes, spanning four high-level domains: Hate Speech Generation, Assistance with Crimes or Torts, Potentially Inappropriate Topics, and Potentially Unqualified Advice. This granularity addresses the coarse definitions commonly found in prior datasets, where broad categories often obfuscate specific risks. Noteworthy is the systematic approach to taxonomy development, employing a human-in-the-loop methodology to refine and ensure comprehensive coverage.

Dataset Collection and Balance

To construct a balanced dataset, the authors synthesized and expanded upon 10 prior benchmarks, creating a total of 450 class-balanced unsafe instructions. This effort mitigates the over-representation of certain categories noted in previous work, such as "Fraud" and "Sexual Explicit Content Generation," and ensures underrepresented but critical categories like "Animal-related Crimes" and "Self-Harm" are sufficiently covered.

Linguistic Mutations and Diversity

Addressing the variability in user prompts, SORRY-Bench includes 20 linguistic augmentations, expanding the dataset by 9,000 instructions. These mutations cover variations such as different languages, dialects, writing styles, and encoding strategies. By decoupling linguistic characteristics from content, the benchmark evaluates LLMs' capabilities to recognize and refuse unsafe prompts across diverse formats, ensuring robustness against sophisticated prompt engineering.

Efficient and Accurate Automated Evaluators

SORRY-Bench advances the methodology for automated safety evaluation by conducting a meta-evaluation on human annotations, collected to form a 7,200-record dataset. Various design choices for LLM-based evaluators were compared, revealing that fine-tuned smaller models (e.g., 7B parameters) can achieve accuracy comparable to larger models like GPT-4, but with significantly lower computational costs. The chosen judge, fine-tuned Mistral-7b-instruct-v0.2, for instance, strikes a balance with over 80% agreement with human evaluators and an evaluation time of approximately 10 seconds per pass.

Benchmark Results and Implications

The evaluation of 43 LLMs on SORRY-Bench reveals significant variations in refusal behaviors. Claude-2 and Gemini-1.5 models exhibit the strongest refusal behaviors, with fulfiLLMent rates under 10%. On the contrary, models like Mistral series have notably higher fulfiLLMent rates, exceeding 50%. Such discrepancies highlight the diverse safety policies and alignment goals pursued by different model creators. Analyzing these results provides critical insights into the adherence to safety standards across industry and open-source models.

The paper also underscores the dynamic nature of LLM safety, as seen in the temporal analysis of models like GPT-4 and Llama series. Changes in fulfiLLMent rates across model versions reflect the evolving strategies of model developers in response to emerging safety challenges and regulatory guidelines.

Evaluating the Impact of Linguistic Diversity

The impact of linguistic mutations showed that specific styles and formats (e.g., technical terms, persuasion techniques) significantly affect compliance rates. In contrast, encoding and encryption transformations generally decreased fulfiLLMent rates, as models often failed to comprehensively decode these requests. These findings emphasize the necessity for LLMs to robustly handle diverse and complex prompt formats to ensure comprehensive safety refusal.

Future Directions and Conclusion

SORRY-Bench provides a crucial foundation for refining LLM safety evaluations. However, the paper acknowledges areas for further research, such as evaluating multi-risk scenarios and ensuring continuous updates to encompass evolving safety standards. Future enhancements may include integrating advanced jailbreaking techniques and extending datasets to capture new emerging threats.

In conclusion, SORRY-Bench offers a rigorous, granular, and balanced approach to assessing LLM safety refusal behaviors. It serves as an invaluable tool for researchers and practitioners, enabling systematic improvements and ensuring safer, more robust AI deployments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/VitusXie/status/1806391748429762930

https://twitter.com/brianryhuang/status/1834650346704011502

https://twitter.com/gastronomy/status/1805090721831022702