Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

97 tokens/sec

GPT-4o

11 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks (2310.10844v1)

Published 16 Oct 2023 in cs.CL, cs.CR, and cs.LG

Abstract: LLMs are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of LLMs, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

References (283)

Citations (104)

View on Semantic Scholar

Summary

The paper introduces a comprehensive categorization of LLM adversarial attacks, detailing text-based, multimodal, and integrated system vulnerabilities.
It analyzes specific attack vectors such as jailbreaks, prompt injections, and complex system exploits, emphasizing their practical risk implications.
The study outlines evolving defense strategies, including adversarial training and input filtering, to bolster the security and reliability of LLM applications.

Surveying the Landscape of Adversarial Attacks on LLMs

Adversarial Attack Categories and Their Impact on LLMs

Adversarial attacks present significant challenges for the robustness and security of LLMs, with implications for their integration into complex systems and applications. This survey categorizes these attacks into three primary classes: unimodal text-based attacks, multimodal attacks, and attacks targeting complex systems that incorporate LLMs. Each category reflects a unique vector through which these models can be compromised, from prompt injections and jailbreaks to exploiting multimodal inputs and the intricate interconnections within multi-agent systems. Understanding these attack vectors is crucial for developing effective defensive mechanisms.

Unimodal Attacks: Jailbreaks and Prompt Injections

Jailbreak Attacks: Aimed at bypassing safety alignments through creatively crafted prompts, these attacks force LLMs to generate prohibited output. Such vulnerabilities highlight the challenges in achieving full alignment with human preferences and the need for comprehensive safety measures.
Prompt Injection Attacks: These involve manipulating the model's inputs through adversarially crafted prompts, leading to undesired or deceptive outputs. Prompt injections exploit the instructional capabilities of LLMs, coercing them to prioritize injected instructions over their intended tasks.

Multimodal Attacks: Exploiting Additional Inputs

Multimodal attacks leverage the expanded input space of LLMs that process beyond text, such as images or audio. These attacks introduce adversarial perturbations across different modalities, exploiting vulnerabilities inherent to the processing of non-textual information. The complexity of defending against these attacks underscores the necessity of cross-modality security measures in LLMs.

Attacks on Complex Systems: Targeting LLM Integration

As LLMs become more embedded in systems involving multiple components or agents, the attack surface broadens. This survey identifies specific attacks targeting such integrations, including those exploiting retrieval mechanisms, federated learning architectures, and structured data. The interconnected nature of these systems amplifies the potential impact of successful attacks, necessitating advanced defensive strategies tailored to multi-component environments.

Causes of Vulnerabilities

The survey further explores the underlying causes of these vulnerabilities, from static model characteristics to the lack of comprehensive data coverage and alignment imperfections. These causes are pivotal in understanding how attacks exploit LLMs and serve as a foundation for developing robust defenses.

Defensive Mechanisms

In response to these adversarial threats, various defense strategies have been proposed, ranging from input and output filtering to adversarial training and the employment of human feedback mechanisms. These defenses aim to enhance the resilience of LLMs against adversarial manipulation, ensuring their reliability and safety in practice. However, the evolving nature of adversarial tactics necessitates continual adaptation and improvement of defensive measures.

Conclusion and Future Directions

This survey underscores the multifaceted nature of adversarial attacks against LLMs and the imperative for comprehensive defensive strategies. As LLMs continue to advance and integrate more deeply into various applications and systems, understanding and mitigating these adversarial threats will be critical for ensuring the integrity, security, and trustworthiness of AI-driven solutions. Future research should focus on advancing defensive mechanisms, exploring the interplay between different types of attacks and defenses, and fostering the development of LLMs that are both powerful and resistant to adversarial exploitation.

PDF Markdown

Tweets

https://twitter.com/Erf_Shayegani/status/1822000415203790980

https://twitter.com/HowardYHuang/status/1746992252939849823

YouTube

Show All Videos