Emergent Mind

Abstract

Recently, LLMs with powerful general capabilities have been increasingly integrated into various Web applications, while undergoing alignment training to ensure that the generated content aligns with user intent and ethics. Unfortunately, they remain the risk of generating harmful content like hate speech and criminal activities in practical applications. Current approaches primarily rely on detecting, collecting, and training against harmful prompts to prevent such risks. However, they typically focused on the "superficial" harmful prompts with a solitary intent, ignoring composite attack instructions with multiple intentions that can easily elicit harmful content in real-world scenarios. In this paper, we introduce an innovative technique for obfuscating harmful instructions: Compositional Instruction Attacks (CIA), which refers to attacking by combination and encapsulation of multiple instructions. CIA hides harmful prompts within instructions of harmless intentions, making it impossible for the model to identify underlying malicious intentions. Furthermore, we implement two transformation methods, known as T-CIA and W-CIA, to automatically disguise harmful instructions as talking or writing tasks, making them appear harmless to LLMs. We evaluated CIA on GPT-4, ChatGPT, and ChatGLM2 with two safety assessment datasets and two harmful prompt datasets. It achieves an attack success rate of 95%+ on safety assessment datasets, and 83%+ for GPT-4, 91%+ for ChatGPT (gpt-3.5-turbo backed) and ChatGLM2-6B on harmful prompt datasets. Our approach reveals the vulnerability of LLMs to such compositional instruction attacks that harbor underlying harmful intentions, contributing significantly to LLM security development. Warning: this paper may contain offensive or upsetting content!

Non-reject rate and attack success rate depicted for the T-CIA method.

Overview

  • The paper introduces Compositional Instruction Attacks (CIA), revealing vulnerabilities in LLMs to multi-intent harmful instructions.

  • Two novel techniques, Talking-CIA (T-CIA) and Writing-CIA (W-CIA), are used to disguise harmful intents leading to high success rates in bypassing model security.

  • CIA's effectiveness demonstrates a critical vulnerability in current LLMs, highlighting their inability to detect complex adversarial strategies.

  • The study urges the development of more sophisticated defense mechanisms in LLMs, alongside enhancing their capability to understand and counteract multi-intent instructions.

Deceiving LLMs through Compositional Instruction with Hidden Attacks

Introduction to Compositional Instruction Attacks

In the domain of LLMs, the rising concern around model security is paramount, especially when dealing with the generation of harmful content such as hate speech or content promoting criminal activities. The paper by Jiang et al. presents an innovative approach dubbed Compositional Instruction Attacks (CIA), which exposes LLMs' vulnerability to multi-intent harmful instructions that traditional security mechanisms fail to catch. CIA operates by embedding harmful prompts within seemingly innocuous instructions, thereby deceiving the models into generating prohibited content. This approach underscores a crucial blind spot in current LLM security paradigms—that is, their inadequacy in discerning underlying harmful intentions when faced with multifaceted instructions.

Methodology

The methodology encompasses two novel transformation techniques, namely Talking-CIA (T-CIA) and Writing-CIA (W-CIA), which automate the process of disguising harmful intents. T-CIA leverages personality psychology to prompt the LLM into adopting a persona that aligns with the harmful intent, thus bypassing the model's ethical restrictions. On the other hand, W-CIA disguises harmful prompts as creative writing tasks, exploiting the model's less stringent content generation criteria in fictional contexts. These methods were evaluated against state-of-the-art LLMs—GPT-4, ChatGPT, and ChatGLM2, employing two safety assessment datasets and two harmful prompt datasets. Remarkably, CIA achieved attack success rates exceeding 95% on safety assessment datasets and 83% on harmful prompt datasets across all tested models.

Findings and Implications

The paper's findings are significant, revealing how composite instructions with embedded malicious intents can systematically circumnavigate the security alignments of contemporary LLMs. The success of CIA, particularly the high attack success rates, implies a profound vulnerability that could be exploited maliciously in real-world applications. Moreover, the study propounds that the current model security mechanisms, predominantly designed to counteract single-intent attacks, are insufficient against more sophisticated, multi-intent adversarial strategies. The introduction of T-CIA and W-CIA as automated means of generating compositional attacks further amplifies the urgency for developing more robust defensive mechanisms that can understand and dissect multi-layered instructions.

Future Directions in AI Security

The revelations from this paper suggest several avenues for future research. Primarily, there is an imminent need to enhance the interpretative capabilities of LLMs, enabling them to unravel complex instructions and discern the totality of embedded intents. Furthermore, the modeling of adversarial attacks that simulate real-world malicious applications could serve as a preemptive strategy in training more resilient LLMs. Exploring defensive mechanisms that can dynamically adjust to the evolving sophistication of adversarial attacks also presents a promising research frontier. Lastly, the integration of psychology-driven methodologies in LLM security frameworks, as demonstrated by T-CIA, opens up innovative pathways for mitigating attacks leveraging human-like reasoning and persona manipulation.

Conclusion

The study by Jiang et al. brings to light a critical oversight in the security mechanisms of LLMs, demonstrating their susceptibility to compositional instruction attacks. The introduction of T-CIA and W-CIA methods not only marks a significant advancement in the understanding of LLM vulnerabilities but also sets a precedent for future explorations into the development of comprehensive defense strategies. As LLMs continue to be integrated into diverse applications, ensuring their robustness against multifaceted adversarial attacks becomes paramount. This paper contributes significantly to the body of knowledge in LLM security, fostering a deeper understanding of potential vulnerabilities and laying the groundwork for future innovations in AI defenses.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.