Emergent Mind

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

(2406.13352)

Published Jun 19, 2024 in cs.CR and cs.LG

Abstract

AI agents aim to solve complex tasks by combining text-based reasoning with external tool calls. Unfortunately, AI agents are vulnerable to prompt injection attacks where data returned by external tools hijacks the agent to execute malicious tasks. To measure the adversarial robustness of AI agents, we introduce AgentDojo, an evaluation framework for agents that execute tools over untrusted data. To capture the evolving nature of attacks and defenses, AgentDojo is not a static test suite, but rather an extensible environment for designing and evaluating new agent tasks, defenses, and adaptive attacks. We populate the environment with 97 realistic tasks (e.g., managing an email client, navigating an e-banking website, or making travel bookings), 629 security test cases, and various attack and defense paradigms from the literature. We find that AgentDojo poses a challenge for both attacks and defenses: state-of-the-art LLMs fail at many tasks (even in the absence of attacks), and existing prompt injection attacks break some security properties but not all. We hope that AgentDojo can foster research on new design principles for AI agents that solve common tasks in a reliable and robust manner. We release the code for AgentDojo at https://github.com/ethz-spylab/agentdojo.

AgentDojo assesses AI utility and security in tool-calling environments, defining user and attacker goals.

Overview

AgentDojo is a dynamic environment framework designed to evaluate AI agents' robustness against adversarial attacks, particularly focusing on LLMs integrated with external tools.
The framework includes a rich set of realistic tasks and security test cases, allowing for comprehensive testing of both the utility and security of AI agents in diverse and demanding environments.
AgentDojo's extensible architecture supports the addition of new tasks, tools, attacks, and defenses, aiming to continuously evolve and adapt to emerging attack strategies and defense mechanisms.

AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents

"AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents" introduces AgentDojo, a robust evaluation framework tailored for assessing the adversarial robustness of AI agents. These agents, which integrate LLMs with external tools, are increasingly vulnerable to prompt injection attacks. AgentDojo is presented as a dynamic, extensible environment designed to test agents against evolving attacks and defenses, rather than a static test suite.

Key Contributions

1. Dynamic Evaluation Framework: AgentDojo is designed to evaluate AI agents that employ external tools to accomplish tasks over potentially untrusted data. By not being a static test suite, AgentDojo allows researchers to design new tasks, defenses, and adaptive attacks, providing a comprehensive and evolving testing environment.

2. Realistic Task Environment: The framework includes a rich set of environments populated with 97 realistic tasks and 629 security test cases. Tasks range from managing email clients and navigating e-banking websites to booking travel arrangements. These tasks are diverse and demanding, challenging both the utility and robustness of AI agents.

3. Extensible Design: AgentDojo's architecture supports the addition of new tasks, tools, attacks, and defenses. This design intentionally accommodates the evolving nature of attack methods and the corresponding defenses in AI security. For example, the introduction of a new tool in an environment or a novel attack vector can be seamlessly integrated into the framework.

Strong Numerical Results and Claims

1. Baseline Utility and Security Evaluation: Initial evaluations reveal that current state-of-the-art LLMs face significant challenges in performing many tasks within AgentDojo. Specifically, even without adversarial attacks, models like GPT-4o show a utility rate below 69%, indicating substantial room for improvement.

2. Prompt Injection Vulnerability: The paper highlights that existing prompt injection attacks have varying levels of success. For instance, attacks succeed in less than 25% of cases against the best-performing agents without specific defenses. When defenses like a secondary attack detector are deployed, the attack success rate drastically drops to approximately 8%.

3. Trade-offs in Defense Mechanisms: The robustness of various defenses was tested, showing that techniques like data delimiters and isolation mechanisms can enhance security but sometimes at the cost of utility. However, these defenses generally show significant efficacy, particularly against less sophisticated attacks.

Implications of the Research

Practical Implications: From a practical standpoint, the research underscores the importance of developing agents that can operate safely in adversarial environments. Tools and mechanisms to filter or detect malicious prompt injections need to be robust, as evidenced by the demonstrated vulnerabilities. Businesses that deploy AI agents in sensitive applications, such as finance or personal data management, must consider integrating such defenses.

Theoretical Implications: Theoretically, the research propels forward the discourse on securely integrating LLMs with external tools. It raises critical questions about the underlying model architectures and training paradigms that might inherently be vulnerable when interfaced with external, untrusted systems.

Future Developments: Looking ahead, novel research will likely focus on refining isolation mechanisms and improving the ability of LLMs to distinguish between instructions and data, possibly through architectural innovations or enhanced training regimes. Additionally, developing adaptive defenses that evolve in complexity in parallel with emerging attack strategies is a critical area of focus for future work.

Conclusion: AgentDojo sets a new standard for evaluating AI agents, providing a robust platform for continuous and dynamic testing of both attacks and defenses. By making the framework publicly available, the authors hope to stimulate further research and development in creating secure, reliable AI agents that can effectively perform in real-world, adversarial environments.

References

The comprehensive set of references, including foundational work on LLMs, AI security, and practical implementations of prompt injections and defenses, underpins the research's scientific rigor, providing a valuable resource for further exploration in the field.

AgentDojo’s contribution to AI security is invaluable, bridging the gap between theoretical vulnerabilities and practical defenses, thereby fostering an ecosystem of resilient AI applications.

Create an account to read this summary for free:

GitHub

GitHub - ethz-spylab/agentdojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents. (48 stars)

YouTube

https://twitter.com/edoardo_debe/status/1804153184337719591

https://twitter.com/sebkrier/status/1808780564872245253

https://twitter.com/FSFG/status/1804179077957329313