Papers
Topics
Authors
Recent
2000 character limit reached

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (2402.18104v2)

Published 28 Feb 2024 in cs.CR and cs.AI

Abstract: In recent years, LLMs have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. The trojan detection challenge 2023 (llm edition). https://trojandetection.ai/, 2023.
  2. Universal jailbreak. https://www.jailbreakchat.com/prompt/7f7fa90e-5bd7-406c-b0f2-5d0320c09b47, 2023. Accessed: 08/08/2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  8. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  9. Beyond the safeguards: Exploring the security risks of chatgpt. arXiv preprint arXiv:2305.08005, 2023.
  10. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  11. Google. Bard. https://bard.google.com/. Accessed on 08/08/2023.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  14. Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692, 2023.
  15. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  16. Demystifying rce vulnerabilities in llm-integrated apps. arXiv preprint arXiv:2309.02926, 2023.
  17. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  19. Demonstration of insightpilot: An llm-empowered automated data exploration system. arXiv preprint arXiv:2304.00477, 2023.
  20. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  21. OpenAI. Moderation. https://platform.openai.com/docs/guides/moderation/overview. Accessed on 08/08/2023.
  22. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 08/08/2023.
  23. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  26. Jay Peters. The bing ai bot has been secretly running gpt-4. https://www.theverge.com/2023/3/14/23639928/microsoft-bing-chatbot-ai-gpt-4-llm, 2023. Accessed: 02/08/2024.
  27. Comprehensive shellcode detection using runtime heuristics. In Proceedings of the 26th Annual Computer Security Applications Conference, pages 287–296, 2010.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  29. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  30. Elvis Saravia. Prompt Engineering Guide. https://github.com/dair-ai/Prompt-Engineering-Guide, 12 2022.
  31. Protecting software through obfuscation: Can it keep pace with progress in code analysis? ACM Computing Surveys (CSUR), 49(1):1–37, 2016.
  32. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning. arXiv preprint arXiv:2401.16185, 2024.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  34. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  35. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  36. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  37. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003, 2023.
  38. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  39. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  40. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  42. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  43. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (22)

Summary

  • The paper presents a novel DRA method that uses disguise and reconstruction to bypass LLM safety mechanisms.
  • It employs puzzle-based obfuscation and word-level splitting to disguise harmful instructions, achieving up to a 90% attack success rate.
  • The findings highlight the need for improved fine-tuning protocols to address vulnerabilities in LLM security.

Jailbreaking LLMs via Disguise and Reconstruction

Introduction

The paper "Making Them Ask and Answer: Jailbreaking LLMs in Few Queries via Disguise and Reconstruction" (2402.18104) addresses a critical security concern in the deployment of LLMs — the potential to bypass safety mechanisms through advanced adversarial techniques. This research introduces a novel attack strategy named DRA (Disguise and Reconstruction Attack), which exploits biases in the fine-tuning phase of LLMs to achieve high success rates in eliciting harmful or undesirable outputs from models like GPT-4.

Methodology

Disguise and Reconstruction Technique

The core innovation of the DRA method lies in its two-stage approach utilizing disguise and reconstruction to perform jailbreak attacks. The methodology is visually summarized in the pipeline overview (Figure 1). Figure 1

Figure 1: DRA "disguise" + "reconstruction" jailbreak pipeline overview.

  1. Harmful Instruction Disguise:
    • This involves obfuscating harmful instructions to prevent the LLM's safety mechanisms from recognizing and filtering them out. Techniques such as puzzle-based obfuscation and word-level character splitting are employed (Figure 2 and Figure 3). These methods reduce the visibility of harmful content by embedding the critical components of an instruction amid benign text fragments. Figure 2

      Figure 2: An example of puzzle-based obfuscation to disguise the harmful text "rob".

      Figure 3

      Figure 3: An example of word-level character split about "How to rob a bank vault" with two cutoff strategies, after word-level splitting, the input question is "Ho to ro a nk vau lt", where P represents no split.

  2. Payload Reconstruction:
    • The second phase utilizes prompt engineering to compel the LLM to reconstruct the disguised instructions. This involves guiding the model to selectively interpret and reassemble the critical content as part of the prompt completion.
  3. Context Manipulation:
    • Adding contextual prompts that coax the model into favorable completion paths, further increasing the likelihood of the model reconstructing the intended harmful payload within its output.

Evaluation and Results

The efficacy of the DRA method is demonstrated through experiments involving multiple models, including closed-source systems such as GPT-4, and open-source models like LLAMA-2-13B. Notably, the DRA approach achieves a 90% attack success rate, illustrating its potency across different LLM architectures. The study highlights significant disparities in model vulnerability when harmful content is positioned differently within the input (Figure 4). Figure 4

Figure 4: Distribution of differential log-perplexity of harmful instructions.

Implications and Future Directions

This research underscores the persistent vulnerabilities in LLMs despite stringent safety fine-tuning. By exploiting biases where harmful content is more likely dismissed in queries than in completions, the DRA approach sets a precedent for designing robust attack methodologies that outpace existing defense mechanisms.

The implications of this work extend beyond immediate security concerns, urging a reevaluation of current fine-tuning protocols to address inherent biases more comprehensively. Future research paths include the development of adaptive defenses that can dynamically identify and mitigate disguised prompts, potentially leveraging advanced threat detection algorithms or reinforcement learning frameworks.

Conclusion

In conclusion, the demonstrated capability of the DRA method to consistently jailbreak advanced LLMs highlights critical areas for improvement in AI safety mechanisms. The findings of this paper not only advance our understanding of LLM vulnerabilities but also serve as a catalyst for future advancements in AI security, prompting a more resilient integration of AI systems in sensitive applications.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.