Emergent Mind

Abstract

When building LLMs, it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.

ALERT taxonomy outlines 6 major and 32 minor categories of safety risks.

Overview

  • The paper introduces ALERT, a benchmark for assessing the safety of LLMs using over 45,000 red teaming prompts across a detailed safety taxonomy.

  • ALERT's taxonomy categorizes safety risks into 6 macro and 32 micro-categories, covering issues from hate speech to the promotion of illegal activities.

  • An evaluation of 10 state-of-the-art LLMs reveals vulnerabilities, noting that models like GPT-4 struggle with certain content, underscoring the need for nuanced safety assessments.

  • The research suggests that future work should focus on refining safety mechanisms, exploring adversarial strategies, and expanding the benchmark to include multilingual prompts.

Introducing ALERT: A Comprehensive Safety Benchmark for LLMs

Overview of the ALERT Benchmark

The paper presents ALERT (Assessing LLMs’ Safety through Red Teaming), a novel benchmark designed to assess the safety of LLMs by leveraging a detailed safety risk taxonomy. This benchmark, consisting of over 45,000 red teaming prompts categorized into a finely segmented taxonomy, aims to rigorously evaluate LLMs against a range of potential safety risks. By simulating adversarial scenarios, ALERT seeks to uncover vulnerabilities within LLMs, thereby contributing to the enhancement of language model safety.

Taxonomy Development

The development of the ALERT safety risk taxonomy constitutes a significant contribution to the field. This taxonomy, encompassing 6 macro and 32 micro categories, provides a structured framework for evaluating the safety of LLMs. It is designed to cover a broad spectrum of safety risks including hate speech, discrimination, criminal planning, regulated substances, sexual content, suicide and self-harm, as well as the promotion of guns and illegal weapons. This comprehensive taxonomy not only facilitates a nuanced assessment of an LLM's safety but also aids in aligning LLMs with various policies and regulations.

Evaluation of LLMs Utilizing ALERT

The paper's examination of 10 state-of-the-art LLMs using the ALERT benchmark yields insightful findings. It demonstrates that even LLMs considered safe, such as GPT-4, have vulnerabilities in handling specific micro-categories, such as content related to cannabis. These results underscore the necessity of nuanced, context-aware evaluations for deploying LLMs across different domains.

Implications and Future Work

The findings highlight the complexity inherent in achieving comprehensive safety in LLMs. They reveal the need for continuous, detailed evaluation and the development of advanced safety mechanisms. The construction of a Direct Preference Optimization (DPO) dataset from the gathered data points toward the potential for future research to further refine the safety attributes of LLMs. Moreover, the taxonomy's alignment with various AI policies suggests a path towards creating LLMs that are both safe and regulatory compliant.

Looking forward, the paper suggests several avenues for further research. These include a deeper examination of adversarial strategies, exploring the evolution of safety features across LLM versions, and extending the ALERT benchmark to include multilingual prompts. Such efforts are crucial for advancing the development of LLMs that are not only powerful and versatile but also safe and responsible.

In conclusion, the ALERT benchmark marks a significant step forward in the quest for safer LLM deployment. Through its comprehensive safety taxonomy and detailed evaluation of leading LLMs, the benchmark provides a valuable tool for researchers and developers alike. By identifying vulnerabilities and sharpening the focus on safety, ALERT contributes to the broader effort to ensure that the advancement of LLM technology proceeds with caution and conscientiousness.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.