Defending Against Indirect Prompt Injection Attacks With Spotlighting (2403.14720v1)

Published 20 Mar 2024 in cs.CR, cs.CL, and cs.LG

Abstract: LLMs, while powerful, are built and trained to process a single text input. In common applications, multiple inputs can be processed by concatenating them together into a single stream of text. However, the LLM is unable to distinguish which sections of prompt belong to various input sources. Indirect prompt injection attacks take advantage of this vulnerability by embedding adversarial instructions into untrusted data being processed alongside user commands. Often, the LLM will mistake the adversarial instructions as user commands to be followed, creating a security vulnerability in the larger system. We introduce spotlighting, a family of prompt engineering techniques that can be used to improve LLMs' ability to distinguish among multiple sources of input. The key insight is to utilize transformations of an input to provide a reliable and continuous signal of its provenance. We evaluate spotlighting as a defense against indirect prompt injection attacks, and find that it is a robust defense that has minimal detrimental impact to underlying NLP tasks. Using GPT-family models, we find that spotlighting reduces the attack success rate from greater than {50}\% to below {2}\% in our experiments with minimal impact on task efficacy.

References (11)

Citations (22)

View on Semantic Scholar

Summary

The paper presents spotlighting as a novel ensemble of prompt engineering techniques to mitigate indirect prompt injection attacks.
Experiments demonstrate that spotlighting decreases attack success rates from over 50% to under 2%, with minimal impact on core NLP tasks.
Key implementations, including delimiting, datamarking, and encoding, offer progressively robust defenses for LLM security.

Defending Against Indirect Prompt Injection Attacks through Spotlighting Techniques

Introduction

The challenge of ensuring the security of LLMs against indirect prompt injection attacks (XPIA) is becoming increasingly significant as these models are integrated into a wider range of applications. XPIA exploits the inability of LLMs to distinguish between multiple sources of input, thereby posing a serious threat to both the integrity and reliability of LLM applications. The paper introduces "spotlighting" as an ensemble of prompt engineering techniques developed to enhance LLMs' ability to differentiate among various input sources effectively. This is achieved through the employment of specific input transformations, providing a continuous signal of input provenance. The researchers have empirically demonstrated that spotlighting significantly diminishes the attack success rate (ASR) for GPT-family models with minimal adverse effects on the efficacy of fundamental NLP tasks.

LLM systems, despite their advanced capabilities in natural language processing tasks, remain vulnerable to prompt injection attacks due to their inherent design. This vulnerability is particularly pronounced in applications processing external data sources, presenting an opportunity for malicious actors to embed adversarial instructions. Prior investigations into XPIA have shed light on its feasibility and the pressing need for robust defensive measures. Earlier responses to mitigate this risk have ranged from alignment tuning which incorporates desirable responses into training objectives, to post-training safeguards like prompt engineering. Yet, these measures have met with limited success against the subtlety and variability of XPIA.

Spotlighting: A Novel Defensive Strategy

The concept of spotlighting emerges from the necessity to provide LLMs with the means to discern between "safe" and "unsafe" blocks of tokens, thereby reducing susceptibility to malicious prompt injections. This paper elaborates on three specific implementations of spotlighting:

Delimiting: Here, special tokens are utilized to demarcate the input text, signaling the model to disregard instructions within these bounds.
Datamarking: Extending beyond simple delimiters, datamarking intersperses a special token throughout the input text, aiding the model in recognizing and isolating the text that should be viewed with scrutiny.
Encoding: This involves transforming the input text using a known encoding algorithm, relying on the model's capacity to decode and process the information while maintaining an awareness of its origin.

Experimental Insights

Upon evaluating the spotlighting techniques against a corpus of documents embedded with prompt injection attacks, the researchers observed a dramatic reduction in ASR across different models and tasks. Specifically, spotlighting reduced the ASR to below 2% from an initial rate exceeding 50%. These results underscore spotlighting's efficacy as a defense mechanism against indirect prompt injections. Additionally, further analysis revealed that while delimiting offered a moderate degree of protection, datamarking and encoding presented more robust defenses, with encoding positioned as the most preferable option given its profound impact on ASR reduction.

Future Perspectives and Conclusion

The advent of spotlighting as a defense technique against XPIA invites a reevaluation of current strategies in securing LLMs. Its success prompts considerations for future LLM design and training protocols, potentially leading to models inherently less susceptible to prompt injection attacks. Moreover, the paper's findings encourage ongoing research to explore more adaptive and dynamic implementation of spotlighting techniques capable of countering evolving attack methodologies. In conclusion, spotlighting represents a significant step forward in enhancing the security framework for LLM applications, mitigating the risks associated with indirect prompt injections while preserving the models' utility across a spectrum of NLP tasks.