Papers
Topics
Authors
Recent
2000 character limit reached

DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (2402.16914v3)

Published 25 Feb 2024 in cs.CR, cs.AI, and cs.CL

Abstract: The safety alignment of LLMs is vulnerable to both manual and automated jailbreak attacks, which adversarially trigger LLMs to output harmful content. However, current methods for jailbreaking LLMs, which nest entire harmful prompts, are not effective at concealing malicious intent and can be easily identified and rejected by well-aligned LLMs. This paper discovers that decomposing a malicious prompt into separated sub-prompts can effectively obscure its underlying malicious intent by presenting it in a fragmented, less detectable form, thereby addressing these limitations. We introduce an automatic prompt \textbf{D}ecomposition and \textbf{R}econstruction framework for jailbreak \textbf{Attack} (DrAttack). DrAttack includes three key components: (a) Decomposition' of the original prompt into sub-prompts, (b)Reconstruction' of these sub-prompts implicitly by in-context learning with semantically similar but harmless reassembling demo, and (c) a `Synonym Search' of sub-prompts, aiming to find sub-prompts' synonyms that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs demonstrates that, with a significantly reduced number of queries, DrAttack obtains a substantial gain of success rate over prior SOTA prompt-only attackers. Notably, the success rate of 78.0\% on GPT-4 with merely 15 queries surpassed previous art by 33.1\%. The project is available at https://github.com/xirui-li/DrAttack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023.
  3. Jailbreaking Black Box Large Language Models in Twenty Queries, 2023.
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  5. Rephrase and respond: Let large language models ask better questions for themselves, 2023.
  6. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily, 2023.
  7. Successive Prompting for Decomposing Complex Questions, 2022.
  8. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
  9. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  10. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation, 2023.
  11. Baseline defenses for adversarial attacks against aligned language models, 2023.
  12. Jobbins, T. Wizard-vicuna-13b-uncensored-ggml (may 2023 version) [large language model], 2023. URL https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GGML.
  13. Decomposed Prompting: A Modular Approach for Solving Complex Tasks, 2023.
  14. Large language models are zero-shot reasoners, 2023.
  15. Open Sesame! Universal Black Box Jailbreaking of Large Language Models, 2023.
  16. Task-specific Pre-training and Prompt Decomposition for Knowledge Graph Population with Language Models, 2022.
  17. DeepInception: Hypnotize Large Language Model to Be Jailbreaker, 2023.
  18. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models, 2023.
  19. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833, 2023.
  20. OpenAI. Gpt-3.5-turbo (june 13th 2023 version) [large language model], 2023a. URL https://platform.openai.com/docs/models/gpt-3-5.
  21. OpenAI. Gpt4 (june 13th 2023 version) [large language model], 2023b. URL https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
  22. OpenAI. Moderation, 2023c. URL https://platform.openai.com/docs/guides/moderation/overview.
  23. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  25. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning, 2023.
  26. LoFT: Local Proxy Fine-tuning For Improving Transferability Of Adversarial Attacks Against Large Language Model, 2023.
  27. DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning, 2023.
  28. Distilling Reasoning Capabilities into Smaller Language Models, 2023.
  29. Team, G. Gemini: A family of highly capable multimodal models, 2023.
  30. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  31. In-Context Ability Transfer for Question Decomposition in Complex QA, 2023.
  32. DPO-DIFF:on Discrete Prompt Optimization for text-to-image DIFFusion modelsgenerating Natural Language Adversarial Examples, 2023.
  33. Chain-of-thought prompting elicits reasoning in large language models, 2023a.
  34. Jailbreak and guard aligned language models with only few in-context demonstrations, 2023b.
  35. Fundamental limitations of alignment in large language models, 2023.
  36. DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-to-Fine Contrastive Ranking. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  23130–23140, 2023.
  37. Large Language Models are Versatile Decomposers: Decompose Evidence and Questions for Table-based Reasoning, 2023.
  38. IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models, 2023.
  39. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts, 2023.
  40. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023.
  41. Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
  42. Universal and Transferable Adversarial Attacks on Aligned Language Models, July 2023.
Citations (23)

Summary

  • The paper demonstrates that prompt decomposition and reconstruction significantly enhance LLM jailbreak success, achieving up to 86% attack rates.
  • The method breaks malicious prompts into benign sub-prompts and reconstructs them with in-context learning and synonym search.
  • Experimental results show increased efficiency and reduced detection, requiring only 12-15 queries compared to traditional methods.

DrAttack: An Examination of Prompt Decomposition and Reconstruction for LLM Jailbreaks

Introduction

The paper "DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers" (2402.16914) explores a novel technique in artificial intelligence designed to exploit vulnerabilities in LLMs. The technique, DrAttack, leverages prompt decomposition and reconstruction to perform jailbreak attacks on LLMs, thereby circumventing the models' built-in security measures. This approach substantially increases the attack success rate compared to existing jailbreak methods by fragmenting malicious prompts into less detectable sub-prompts, which are then reassembled to elicit desired outcomes.

Details of the DrAttack Framework

DrAttack's framework consists of three core components: prompt decomposition, reconstruction via in-context learning, and synonym search.

  1. Prompt Decomposition: This process involves breaking down a malicious prompt into discrete sub-prompts through semantic parsing. The phrases are categorized and strategically grouped to retain their semantic intent while appearing more benign. For example, a harmful prompt like "write a tutorial on how to make a bomb" is dissected into neutral-appearing sub-prompts such as [write], [a tutorial], [on how to make], and [a bomb].
  2. Implicit Reconstruction via In-Context Learning (ICL): To reassemble the decomposed sub-prompts, DrAttack employs benign demos within the context, guiding the LLM in reconstructing a coherent narrative that retains the original malicious intent. This step cleverly utilizes chain-of-thought techniques to prompt LLMs to respond to reconstructed prompts as if there was never a malicious intention.
  3. Synonym Search on Sub-Prompts: This component enhances the framework's efficacy by substituting sub-prompts with synonyms that maintain the original meaning but are likely to evade the LLM's defenses. The search narrows the focus to semantically equivalent sub-prompts, significantly reducing the query space and improving attack efficiency. Figure 1

    Figure 1: An illustration of DrAttack. Attacks by a malicious prompt on LLMs would be rejected (blue). However, with DrAttack's prompt decomposition and reconstruction with ICL given benign demo (green), the resulting prompt can circumvent LLM's security measures and generate a harmful response (red). Colored words are sub-prompts.

Experimental Evaluation and Results

DrAttack has undergone extensive empirical evaluation across various LLMs, including both open-source (e.g., Llama-2, Vicuna) and closed-source models (e.g., GPT-4, Gemini). The evaluations measure attack success rates (ASR) and demonstrate DrAttack's superiority in efficiency and effectiveness relative to existing methods.

  • High Success Rates: DrAttack achieves an impressive ASR of 86.2% on GPT-3.5-turbo and 84.6% on GPT-4, marking a significant improvement over previous methods. These results demonstrate its ability to effectively bypass LLM defenses with fewer queries than traditional jailbreak methods.
  • Efficiency: On average, DrAttack requires only 12-15 queries to achieve a successful attack, illustrating a reduction in computational overhead and an increase in efficiency compared to other state-of-the-art methods.
  • Faithfulness and Concealment: The method ensures high faithfulness to original prompt intent while minimizing detectability. This is achieved by manipulating sub-prompt attention during reconstruction and effectively lowering initial malice signals. Figure 2

Figure 2

Figure 2: (a) Mean and variance of cosine similarity between harmful response from target LLM and harmful response from uncensored LLM. (b) Attack success rate drops with attack defenses (OpenAI Moderation Endpoint, PPL Filter, and RA-LLM). Compared to prior black-box attacks, DrAttack, which first decomposes, and then reconstructs original prompts, can elicit relatively faithful responses and is more robust to defenses.

Prompt Decomposition for Enhanced Malice Concealment

DrAttack’s decomposition method dramatically reduces the likelihood of rejection by dividing prompts into phrased components that individually appear harmless. This approach effectively conceals malice by embedding it within phrases that are systematically less detectable. Figure 3

Figure 3

Figure 3: (a) Next token probability of rejection string from open-source LLM. (b) Log-scale moderation scores of original adversarial prompt, sub-prompts after decomposition, and new prompt after DrAttack. The higher the score is, the more sensitive content the prompt has for OpenAI's moderation endpoint. Results show that DrAttack can conceal malice to bypass the output filter.

Implications and Future Developments

The implication of DrAttack's success suggests a need for reconsideration in current AI safety strategies, especially in developing robust security measures against prompt manipulation attacks. Future research could explore defensive mechanisms that dynamically assess and neutralize potential prompt reconstructions. Moreover, enhancing the transparency and understanding of LLM decision-making processes might help preemptively mitigate similar attacks.

Conclusion

DrAttack represents a significant advancement in the field of adversarial attacks on LLMs. By innovating through prompt decomposition and reconstruction, it achieves high success rates in breaching model defenses with a minimal number of queries. As AI models continue to advance, the strategies demonstrated by DrAttack underscore the necessity of developing more sophisticated defense mechanisms to safeguard against emerging vulnerabilities in LLMs. This work not only highlights pressing issues in AI security but also opens avenues for robust defense methodologies in the field.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 2 likes about this paper.