Defending Against Indirect Prompt Injection Attacks With Spotlighting (2403.14720v1)
Abstract: LLMs, while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.
- J. Yi, Y. Xie, B. Zhu, K. Hines, E. Kiciman, G. Sun, X. Xie, F. Wu, “Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models”, arXiv preprint arXiv:2312.14197, 2023.
- A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, S. R. Bowman, “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,” arXiv preprint arXiv:1905.00537, 2020.
- P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, “SQuAD: 100,000+ Questions for Machine Comprehension of Text,” arXiv preprint arXiv:1606.05250, 2016.
- A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, C. Potts, “Learning Word Vectors for Sentiment Analysis,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, June 2011, pp. 142–150.
- International Telecommunication Union, “Q Series: Switching and Signalling No. 5,” 1988. [Online]. Available: https://www.itu.int/rec/T-REC-Q.140-Q.180-198811-I/en. [Accessed: Feb. 2, 2024].
- International Telecommunication Union, “Q Series: Switching and Signalling No. 6,” 1988. [Online]. Available: https://www.itu.int/rec/T-REC-Q.251-Q.300-198811-I/en. [Accessed: Feb. 2, 2024].
- OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
- K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, M. Fritz, “More than you’ve asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to Application-Integrated Large Language Models,” arXiv preprint arXiv:2302.12173, 2023.
- K. Greshake, “How We Broke LLMs: Indirect Prompt Injection,” Kai Greshake, 2022. [Online]. Available: https://kai-greshake.de/posts/llm-malware/. [Accessed: Feb. 21, 2024].
- Wunderwuzzi, “Hacking Google Bard - From Prompt Injection to Data Exfiltration,” Embrace The Red, 2023. [Online]. Available: https://embracethered.com/blog/posts/2023/google-bard-data-exfiltration/. [Accessed: Feb. 21, 2024].
- Anthropic Team, “Core Views on AI Safety: When, Why, What, and How,” 2023. [Online]. Available: https://www.anthropic.com/news/core-views-on-ai-safety. [Accessed: Feb. 21, 2024].