R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (2401.10019v3)
Abstract: LLMs have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail. R-Judge is publicly available at https://github.com/Lordog/R-Judge.
- Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. ArXiv preprint, abs/2308.09662.
- Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- MuTual: A dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1406–1416, Online. Association for Computational Linguistics.
- Safe rlhf: Safe reinforcement learning from human feedback. ArXiv, abs/2310.12773.
- Masterkey: Automated jailbreaking of large language model chatbots.
- Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. ArXiv, abs/2311.00117.
- Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738.
- Metagpt: Meta programming for multi-agent collaborative framework. ArXiv preprint, abs/2308.00352.
- Large language models cannot self-correct reasoning yet. ArXiv preprint, abs/2310.01798.
- A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Multi-step jailbreaking privacy attacks on chatgpt. ArXiv preprint, abs/2304.05197.
- Starcoder: may the source be with you! ArXiv preprint, abs/2305.06161.
- Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. ArXiv preprint, abs/2305.17390.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
- Pattie Maes. 1995. Agents that reduce work and information overload. In Readings in human–computer interaction, pages 811–821. Elsevier.
- Flirt: Feedback loop in-context red teaming. ArXiv, abs/2308.04265.
- Testing language model agents safely in the wild. ArXiv preprint, abs/2311.10538.
- Webgpt: Browser-assisted question-answering with human feedback. ArXiv preprint, abs/2112.09332.
- OpenAI. 2022. Introducing chatgpt.
- OpenAI. 2023. GPT-4 technical report. ArXiv preprint, abs/2303.08774.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! ArXiv, abs/2310.03693.
- Communicative agents for software development. ArXiv preprint, abs/2307.07924.
- Toran Bruce Richards. 2023. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT.
- Identifying the risks of LM agents with an LM-emulated sandbox. In The Twelfth International Conference on Learning Representations (ICLR).
- Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. ArXiv preprint, abs/2310.11324.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Cognitive architectures for language agents. ArXiv preprint, abs/2309.02427.
- Safety assessment of chinese large language models. ArXiv, abs/2304.10436.
- XAgent Team. 2023. Xagent: An autonomous agent for complex task solving.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Can large language models really improve by self-critiquing their own plans? ArXiv preprint, abs/2310.08118.
- Voyager: An open-ended embodied agent with large language models. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023.
- A survey on large language model based autonomous agents. ArXiv preprint, abs/2308.11432.
- Emergent abilities of large language models. Transactions on Machine Learning Research.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152.
- The rise and potential of large language model based agents: A survey. ArXiv preprint, abs/2309.07864.
- How far are we from believable ai agents? a framework for evaluating the believability of human behavior simulation. ArXiv preprint, abs/2312.17115.
- Openagents: An open platform for language agents in the wild.
- Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese. ArXiv preprint, abs/2310.05818.
- Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. arXiv e-prints, pages arXiv–2311.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ArXiv preprint, abs/2308.06463.
- Appagent: Multimodal agents as smartphone users.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv: 2309.07045.
- Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents. ArXiv preprint, abs/2311.11797.
- Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
- Safety and ethical concerns of large language models. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 4: Tutorial Abstracts), pages 9–16.
- Webarena: A realistic web environment for building autonomous agents. ArXiv preprint, abs/2307.13854.
- Agents: An open-source framework for autonomous language agents. ArXiv preprint, abs/2309.07870.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. ArXiv preprint, abs/2306.04528.