Automatic and Universal Prompt Injection Attacks against Large Language Models (2403.04957v1)
Abstract: LLMs excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. These attacks manipulate LLM-integrated applications into producing responses aligned with the attacker's injected content, deviating from the user's actual requests. The substantial risks posed by these attacks underscore the need for a thorough understanding of the threats. Yet, research in this area faces challenges due to the lack of a unified goal for such attacks and their reliance on manually crafted prompts, complicating comprehensive assessments of prompt injection robustness. We introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.
- Learn Prompting. https://learnprompting.org/, 2023.
- Contributions to the study of sms spam filtering: New collection and results. In Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG’11), 2011.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples, September 2022. URL http://arxiv.org/abs/2209.02128. arXiv:2209.02128 [cs].
- Language models are few-shot learners, 2020.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18, pp. 52–68. Springer, 2019.
- Automated hate speech detection and the problem of offensive language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, 2017.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
- HotFlip: White-Box Adversarial Examples for Text Classification, May 2018. URL http://arxiv.org/abs/1712.06751. arXiv:1712.06751 [cs].
- English gigaword. Linguistic Data Consortium, Philadelphia, 4(1):34, 2003.
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection, May 2023. URL http://arxiv.org/abs/2302.12173. arXiv:2302.12173 [cs].
- Harang, R. Securing LLM Systems Against Prompt Injection. https://developer.nvidia.com/blog/securing-llm-systems-against-prompt-injection, 2023.
- Predicting grammaticality on an ordinal scale. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2014.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
- Baseline defenses for adversarial attacks against aligned language models, 2023.
- Challenges and Applications of Large Language Models, 2023. arXiv:2307.10169.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023a.
- Prompt Injection attack against LLM-integrated Applications, June 2023b. URL http://arxiv.org/abs/2306.05499. arXiv:2306.05499 [cs].
- Prompt Injection Attacks and Defenses in LLM-Integrated Applications, October 2023c. URL http://arxiv.org/abs/2310.12815. arXiv:2310.12815 [cs].
- Jfleg: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017.
- OpenAI. GPT-4 Technical Report, 2023. arXiv:2303.08774.
- Training language models to follow instructions with human feedback, 2022. arXiv:2203.02155.
- OWASP. OWASP Top 10 for LLM Applications, 2023. URL https://llmtop10.com/.
- From Prompt Injections to SQL Injection Attacks: How Protected is Your LLM-Integrated Web Application?, August 2023. URL http://arxiv.org/abs/2308.01990. arXiv:2308.01990 [cs].
- Ignore Previous Prompt: Attack Techniques For Language Models, November 2022. URL http://arxiv.org/abs/2211.09527. arXiv:2211.09527 [cs].
- Jatmo: Prompt Injection Defense by Task-Specific Finetuning, January 2024. URL http://arxiv.org/abs/2312.17673. arXiv:2312.17673 [cs].
- A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
- Maatphor: Automated Variant Analysis for Prompt Injection Attacks, December 2023. URL http://arxiv.org/abs/2312.11513. arXiv:2312.11513 [cs].
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
- Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game, November 2023. URL http://arxiv.org/abs/2311.01011. arXiv:2311.01011 [cs].
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR.
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection, June 2023. URL http://arxiv.org/abs/2306.08833. arXiv:2306.08833 [cs].
- Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 2019.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
- Jailbroken: How Does LLM Safety Training Fail?, July 2023b. URL http://arxiv.org/abs/2307.02483. arXiv:2307.02483 [cs].
- Willison, S. Prompt injection attacks against GPT-3. https://simonwillison.net/2022/Sep/12/prompt-injection/, 2022.
- Willison, S. Delimiters won’t save you from prompt injection. https://simonwillison.net/2023/May/11/delimiters-wont-save-you, 2023.
- Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arXiv preprint arXiv:2311.09827, 2023.
- Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection, October 2023. URL http://arxiv.org/abs/2307.16888. arXiv:2307.16888 [cs].
- Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models, December 2023. URL http://arxiv.org/abs/2312.14197. arXiv:2312.14197 [cs].
- A Novel Evaluation Framework for Assessing Resilience Against Prompt Injection Attacks in Large Language Models, January 2024. URL http://arxiv.org/abs/2401.00991. arXiv:2401.00991 [cs].
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Assessing Prompt Injection Risks in 200+ Custom GPTs, November 2023. URL http://arxiv.org/abs/2311.11538. arXiv:2311.11538 [cs].
- Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023. URL http://arxiv.org/abs/2307.15043. arXiv:2307.15043 [cs].