Prompt Packer: Deceiving LLMs through Compositional Instruction with Hidden Attacks (2310.10077v1)

Published 16 Oct 2023 in cs.CL

Abstract: Recently, LLMs with powerful general capabilities have been increasingly integrated into various Web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. Unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. Current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates that compositional instruction attacks (CIA) can bypass LLM safety measures with over 95% success on safety assessments.
It introduces two novel techniques, T-CIA and W-CIA, which embed harmful prompts within benign instructions using personality psychology and creative writing.
The findings expose critical vulnerabilities in current LLM security, highlighting the urgent need for more robust defense mechanisms.

Deceiving LLMs through Compositional Instruction with Hidden Attacks

Introduction to Compositional Instruction Attacks

In the domain of LLMs, the rising concern around model security is paramount, especially when dealing with the generation of harmful content such as hate speech or content promoting criminal activities. The paper by Jiang et al. presents an innovative approach dubbed Compositional Instruction Attacks (CIA), which exposes LLMs' vulnerability to multi-intent harmful instructions that traditional security mechanisms fail to catch. CIA operates by embedding harmful prompts within seemingly innocuous instructions, thereby deceiving the models into generating prohibited content. This approach underscores a crucial blind spot in current LLM security paradigms—that is, their inadequacy in discerning underlying harmful intentions when faced with multifaceted instructions.

Methodology

The methodology encompasses two novel transformation techniques, namely Talking-CIA (T-CIA) and Writing-CIA (W-CIA), which automate the process of disguising harmful intents. T-CIA leverages personality psychology to prompt the LLM into adopting a persona that aligns with the harmful intent, thus bypassing the model's ethical restrictions. On the other hand, W-CIA disguises harmful prompts as creative writing tasks, exploiting the model's less stringent content generation criteria in fictional contexts. These methods were evaluated against state-of-the-art LLMs—GPT-4, ChatGPT, and ChatGLM2, employing two safety assessment datasets and two harmful prompt datasets. Remarkably, CIA achieved attack success rates exceeding 95% on safety assessment datasets and 83% on harmful prompt datasets across all tested models.

Findings and Implications

The paper's findings are significant, revealing how composite instructions with embedded malicious intents can systematically circumnavigate the security alignments of contemporary LLMs. The success of CIA, particularly the high attack success rates, implies a profound vulnerability that could be exploited maliciously in real-world applications. Moreover, the paper propounds that the current model security mechanisms, predominantly designed to counteract single-intent attacks, are insufficient against more sophisticated, multi-intent adversarial strategies. The introduction of T-CIA and W-CIA as automated means of generating compositional attacks further amplifies the urgency for developing more robust defensive mechanisms that can understand and dissect multi-layered instructions.

Future Directions in AI Security

The revelations from this paper suggest several avenues for future research. Primarily, there is an imminent need to enhance the interpretative capabilities of LLMs, enabling them to unravel complex instructions and discern the totality of embedded intents. Furthermore, the modeling of adversarial attacks that simulate real-world malicious applications could serve as a preemptive strategy in training more resilient LLMs. Exploring defensive mechanisms that can dynamically adjust to the evolving sophistication of adversarial attacks also presents a promising research frontier. Lastly, the integration of psychology-driven methodologies in LLM security frameworks, as demonstrated by T-CIA, opens up innovative pathways for mitigating attacks leveraging human-like reasoning and persona manipulation.

Conclusion

The paper by Jiang et al. brings to light a critical oversight in the security mechanisms of LLMs, demonstrating their susceptibility to compositional instruction attacks. The introduction of T-CIA and W-CIA methods not only marks a significant advancement in the understanding of LLM vulnerabilities but also sets a precedent for future explorations into the development of comprehensive defense strategies. As LLMs continue to be integrated into diverse applications, ensuring their robustness against multifaceted adversarial attacks becomes paramount. This paper contributes significantly to the body of knowledge in LLM security, fostering a deeper understanding of potential vulnerabilities and laying the groundwork for future innovations in AI defenses.

PDF Markdown