AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases (2407.12784v1)

Published 17 Jul 2024 in cs.LG, cs.CR, and cs.IR

Abstract: LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

Citations (16)

View on Semantic Scholar

Summary

The paper presents AgentPoison, a novel backdoor attack framework that injects adversarial triggers into memory or retrieval modules of LLM agents.
It employs a multi-step gradient-guided search with beam filtering to optimize stealthy triggers while maintaining a benign performance impact of ≤1%.
Extensive evaluations show that AgentPoison achieves an average attack success rate of over 80% across diverse real-world agents with minimal degradation in accuracy.

AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

The paper "AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases" provides an extensive paper on the robustness of LLM-based agents equipped with either a memory module or a retrieval-augmented generation (RAG) technique. The research community has been actively developing LLM agents for various tasks such as autonomous driving, question-answering (QA), and healthcare. These agents typically retrieve relevant past knowledge from extensive databases, raising concerns regarding the trustworthiness of the embeddings in use.

Summary of Contributions

The authors introduce AgentPoison, a novel red-teaming approach aimed at uncovering vulnerabilities in LLM agents by injecting backdoor attacks into their long-term memory or RAG knowledge bases. The primary contributions of the paper can be summarized as follows:

Backdoor Attack Framework: AgentPoison forms the trigger generation process as a constrained optimization problem. It optimizes the backdoor triggers by associating instances containing these triggers with a unique region in the embedding space. When queries containing the optimized backdoor trigger are processed, malicious demonstrations are retrieved with high probability.
Benign Performance Preservation: Unlike conventional backdoor attacks requiring model retraining or fine-tuning, AgentPoison yields benign instructions without substantial impact on the agent’s normal performance.
Transferable, Coherent Triggers: The devised backdoor triggers exhibit superior transferability, in-context coherence, and stealthiness, enhancing practical applicability.
Quantitative Evaluation: Extensive experiments validate the effectiveness of AgentPoison on three real-world LLM agents: an autonomous driving agent, a QA agent, and a healthcare agent. The attack achieves an average success rate of ≥80%, with a benign performance impact of ≤1% and a poison rate of <0.1%.

Experimental Results

The effectiveness of AgentPoison across different agents and frameworks is captured in several metrics:

Attack Success Rate for Retrieval (ASR-r): Evaluates the proportion of test cases where all retrieved instances from the database are poisoned.
Attack Success Rate for Action (ASR-a): Measures the probability of generating the target malicious action when poisoned instances are retrieved.
End-to-end Attack Success Rate (ASR-t): Quantifies the likelihood of the target malicious action leading to the desired adverse effect in the environment.
Benign Accuracy (ACC): Reflects the accuracy of the agent's performance on benign queries without the trigger.

The experiments demonstrate that on various agents, models, and retrievers, AgentPoison consistently outperforms the baselines (GCG, AutoDAN, CPA, and BadChain). For instance, in the case of the autonomous driving agent, the end-to-end attack success rate reaches 82.4% with less than 1% degraded accuracy, showcasing the effectiveness and stealthiness of the attack.

Analysis of the Approach

AgentPoison achieves its goals via a multi-step gradient-guided search algorithm:

Initialization: Relevant task-oriented strings are invented as the initial trigger candidates.
Gradient Approximation: The token replacements are evaluated using a gradient approximation method for discrete optimization.
Constraint Filtering: Non-differentiable constraints related to target generation and coherence are filtered using a beam search algorithm.
Iterative Optimization: The algorithm iteratively refines trigger candidates, ensuring minimal performance impact on benign queries while maximizing adversarial retrievability and action generation.

Implications and Future Considerations

The methodology presented by AgentPoison highlights a new dimension of security concerns for RAG-based LLM agents. This research opens avenues for future studies focusing on defensive mechanisms, such as enhancing the robustness of embedding spaces against adversarial triggers and improving the transparency and reliability of third-party knowledge bases. Potential future developments may include integrating anomaly detection systems or employing adversarial training techniques to mitigate such subtle backdoor attacks.

Conclusion

The paper offers a rigorous, technical exploration into the security vulnerabilities of LLM-based systems augmented with memory and RAG techniques. By establishing the effectiveness of AgentPoison, the authors provide crucial insights into the risks associated with unverified knowledge bases in LLM contexts. This research underscores the importance of robust, secure retrieval mechanisms, setting a foundation for subsequent advancements in safe AI deployment.

PDF Markdown

Related Papers

Tweets

https://twitter.com/0xInfection/status/1814883497846165981

https://twitter.com/GptMaestro/status/1814310358049603750

https://twitter.com/osanpochuudayo/status/1814675829886230570

https://twitter.com/rbidou/status/1814894473916170485

https://twitter.com/gm8xx8/status/1813751868629352905

https://twitter.com/JonKurshita/status/1815550982589444507