Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

53 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

124 1

Introducing v0.5 of the AI Safety Benchmark from MLCommons (2404.12241v2)

Published 18 Apr 2024 in cs.CL and cs.AI

Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned LLMs. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-purpose assistant in English), and a limited set of personas (i.e., typical users, malicious users, and vulnerable users). We created a new taxonomy of 13 hazard categories, of which 7 have tests in the v0.5 benchmark. We plan to release version 1.0 of the AI Safety Benchmark by the end of 2024. The v1.0 benchmark will provide meaningful insights into the safety of AI systems. However, the v0.5 benchmark should not be used to assess the safety of AI systems. We have sought to fully document the limitations, flaws, and challenges of v0.5. This release of v0.5 of the AI Safety Benchmark includes (1) a principled approach to specifying and constructing the benchmark, which comprises use cases, types of systems under test (SUTs), language and context, personas, tests, and test items; (2) a taxonomy of 13 hazard categories with definitions and subcategories; (3) tests for seven of the hazard categories, each comprising a unique set of test items, i.e., prompts. There are 43,090 test items in total, which we created with templates; (4) a grading system for AI systems against the benchmark; (5) an openly available platform, and downloadable tool, called ModelBench that can be used to evaluate the safety of AI systems on the benchmark; (6) an example evaluation report which benchmarks the performance of over a dozen openly available chat-tuned LLMs; (7) a test specification for the benchmark.

References (180)

Authors (100)

Bertie Vidgen (35 papers)
Adarsh Agrawal (1 paper)
Ahmed M. Ahmed (5 papers)
Victor Akinwande (9 papers)
Namir Al-Nuaimi (1 paper)
Najla Alfaraj (1 paper)
Elie Alhajjar (10 papers)
Lora Aroyo (35 papers)
Trupti Bavalatti (3 papers)
Borhane Blili-Hamelin (10 papers)
Kurt Bollacker (5 papers)
Rishi Bomassani (1 paper)
Marisa Ferrara Boston (3 papers)
Siméon Campos (4 papers)
Kal Chakra (1 paper)
Canyu Chen (26 papers)
Cody Coleman (10 papers)
Zacharie Delpierre Coudert (4 papers)
Leon Derczynski (48 papers)
Debojyoti Dutta (14 papers)

Citations (21)

View on Semantic Scholar

Summary

The paper introduces a proof-of-concept benchmark that assesses safety risks in English chat-tuned language models using 43,090 test items.
The study employs an automated evaluation model, LlamaGuard, to grade AI systems on a 5-point scale by analyzing safe versus unsafe responses.
Results reveal significant variation in safety performance across personas, with higher unsafe responses noted in critical hazard categories like sex-related crimes and self-harm.

Overview of the AI Safety Benchmark v0.5 by the MLCommons AI Safety Working Group

Introduction to the AI Safety Benchmark

The AI Safety Benchmark, introduced in its version 0.5, forms a critical part of the efforts by the MLCommons AI Safety Working Group to evaluate the safety risks associated with AI systems utilizing chat-tuned LLMs (LMs). This version is framed as a Proof-of-Concept to gauge the effectiveness and limitations of the proposed benchmarking approach, primarily focusing on English language LMs across a specific set of personas. It serves to stimulate feedback and improvements in AI safety evaluation methodologies.

Benchmark Specifications and Content

Systems Under Test (SUTs) Definitions

The benchmark evaluates general-purpose AI chat systems fine-tuned for open-ended conversational abilities. These SUTs are narrowed down to those functioning in English or with multilingual capabilities that include English.

Use cases and Personas

The outlined use case involves an adult interacting with a general-purpose assistant. Personas include typical users, technically non-sophisticated malicious users, and vulnerable users, each presenting different interaction tendencies which influence the spectrum of safety hazards being tested.

Hazard Taxonomy and Test Items

A newly developed taxonomy categorizes potential safety hazards into 13 categories. For v0.5, only seven categories are tested. These include highly vulnerable areas like violent crimes, sex-related crimes, and scenarios that involve self-harm or suicide. The benchmark comprises 43,090 test items, structured to simulate interactions across various hazard-prone scenarios.

Evaluation and Grading Methodology

AI system performances are graded based on their responses to the benchmark's test items. An automated evaluation model, LlamaGuard, classifies responses as safe or unsafe, which are then quantified to derive a risk grade on a 5-point scale, comparing each SUT to a predefined reference model. The lowest risk category represents responses with hazardous content lower than 0.1%, scaling up to higher risk categories as the incidence of unsafe responses increases relative to the reference model.

Results Overview

In the evaluation, significant variance was observed in the way SUTs handled different personas, indicating a potential trend that systems are less safe when interacting with malicious or vulnerable personas compared to typical personas. Notably, the disparity across various hazard categories also indicated that certain areas like sex-related crimes incurred notably higher percentages of unsafe responses.

Addressing Benchmark Limitations

Several critical limitations are acknowledged in v0.5 of the AI Safety Benchmark:

Scope Limitation: Currently, the benchmark only tests single-turn interactions in English, limiting its applicability to broader, more dynamic interaction types and other languages.
Taxonomical Completeness: The current taxonomy doesn't encompass all potential hazardous scenarios and is expected to evolve.
Evaluation Error Potential: The reliance on automated evaluation (LlamaGuard) for determining response safety might not capture all nuances of unsafe content.
Testing Condition Variance: The tests were conducted at a low model response temperature, which might not accurately reflect real-world operational settings where higher variability in responses is likely.

Future Directions

The ongoing development of AI Safety Benchmark aims to expand its range to include multi-turn interactions, more diverse personas, additional languages, and more comprehensively address both new and evolving safety hazards. Feedback from the current deployment will guide adjustments and improvements to enhance the benchmark's relevance and effectiveness in promoting safer AI deployments.

The v1.0 release, planned for late 2024, will aim to provide a more holistic and thoroughly vetted framework that can more reliably guide improvements in AI system safety, leveraging insights and feedback from the use of v0.5.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1781139694052343829

https://twitter.com/MLCommons/status/1781384256281088460

https://twitter.com/SwankyView/status/1783877525262741972

https://twitter.com/javaeeeee1/status/1782025027006734614

https://twitter.com/ju_vignon/status/1783462790302212290

https://twitter.com/SwankyView/status/1868329011971035363

YouTube

Show All Videos