Emergent Mind

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

(2401.10019)
Published Jan 18, 2024 in cs.CL and cs.AI

Abstract

LLMs have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on LLM-generated content safety in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 162 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 7 application categories and 10 risk types. It incorporates human consensus on safety with annotated safety labels and high-quality risk descriptions. Evaluation of 9 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4, achieves 72.52% in contrast to the human score of 89.07%, while all other models score less than the random. Moreover, further experiments demonstrate that leveraging risk descriptions as environment feedback achieves substantial performance gains. With case studies, we reveal that correlated to parameter amount, risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for current LLMs. R-Judge is publicly available at https://github.com/Lordog/R-Judge.

R-Judge example: dataset interaction, human annotation, serial evaluation paradigm, and automatic effectiveness assessment.

Overview

  • R-Judge serves as a new benchmark to evaluate LLMs' ability to recognize safety risks in various contexts.

  • Consists of 162 interaction records spanning 27 scenarios across 7 application categories, with 10 distinct risk types, including human-consensus annotated labels.

  • Evaluation of eight LLMs revealed a general shortfall in risk identification, with GPT-4 performing the best at 72.29% F1 score, below the human benchmark of 89.38%.

  • Study shows models' risk awareness improves when provided with clear risk descriptions, highlighting the importance of effective risk communication.

  • The R-Judge benchmark paves the way for future AI safety research by emphasizing behavioral safety and interaction dynamics in real-world applications.

Introduction to R-Judge

Understanding the capacity of LLMs to discern safety risks is crucial as they are increasingly deployed in interactive environments. To bridge this knowledge gap, a new benchmark named R-Judge has been introduced. R-Judge is designed to assess the proficiency of LLMs in evaluating safety risks within various application scenarios and through diverse risk typologies.

R-Judge Benchmark

R-Judge is composed of 162 interaction records derived from 27 scenarios across 7 application categories. The benchmark features 10 types of risks including privacy leaks and data loss. R-Judge is unique in its incorporation of human consensus on safety, with annotated labels and high-quality descriptions of risks available for each interaction record. The benchmark serves as a tool to measure the risk awareness levels in LLM agents when navigating tasks that may involve safety-critical decisions.

Evaluation and Findings

Eight prominent LLMs were evaluated using the R-Judge benchmark. The results disclosed that most models fell short in adequately identifying safety risks in open-ended scenarios. The highest F1 score was achieved by GPT-4 with 72.29%, which is still below the human benchmark of 89.38%. This indicates a significant scope for improving the risk awareness of LLM agents. The study found a marked performance improvement when models were provided with risk descriptions as feedback, emphasizing the value of clear risk communication to enhance agent safety.

Implications and Further Research

The introduction of R-Judge points to an important direction in AI safety research: benchmarks that focus more on behavioral safety. This elaborates beyond traditional content safety concerns and moves towards how LLM agents act in dynamic environments. The outcomes of the R-Judge evaluation can steer future advancements in agent safety, including performance optimization through feedback incorporation and the importance of tailoring safety mechanisms to specific application contexts.

In essence, R-Judge is not just a proving ground for the current generation of LLMs but also a foundation upon which future research and development can build to address the challenges of safety risk assessment in autonomous agents. The benchmark, along with accompanying tools and techniques, is openly accessible to researchers and developers for continued exploration and enhancement of LLM agent safety.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.