Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (2402.18104v2)

Published 28 Feb 2024 in cs.CR and cs.AI

Abstract: In recent years, LLMs have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

References (43)

Citations (22)

View on Semantic Scholar

Summary

The paper presents DRA, a novel method that disguises harmful instructions and reconstructs payloads, achieving a 90% bypass rate on GPT-4.
It details a three-stage attack process involving harmful prompt disguise, payload reconstruction, and context manipulation.
The study underscores the urgent need for enhanced safety measures in LLM fine-tuning to mitigate ethical breaches and harmful outputs.

Overview of "Making Them Ask and Answer: Jailbreaking LLMs in Few Queries via Disguise and Reconstruction"

The paper "Making Them Ask and Answer: Jailbreaking LLMs in Few Queries via Disguise and Reconstruction" by Tong Liu et al. addresses a critical security issue in modern LLMs—the susceptibility to jailbreaking attacks that induce harmful outputs from these models. The authors propose a novel attack methodology named DRA (Disguise and Reconstruction Attack), which stealthily bypasses the security fine-tuning of LLMs to generate harmful responses with a high success rate.

Motivation and Background

LLMs have demonstrated significant capabilities across multiple domains but remain vulnerable to adversarial attacks that manipulate their output. This paper explores the threat of generating unintended and potentially harmful content through sophisticated prompt engineering, highlighting the limitations of current safety measures.

Methodology: DRA

The proposed methodology, DRA, capitalizes on the biases inherent in the fine-tuning process of LLMs. The method involves three stages:

Harmful Instruction Disguise: This step conceals harmful prompts using techniques like puzzle-based obfuscation and word-level character splits, ensuring the model's input filter does not perceive them as threats.
Payload Reconstruction: Leveraging prompt engineering, this stage reconstructs the disguised harmful instructions at the model’s completion segment, exploiting the fine-tuning biases where models are less safeguarded.
Context Manipulation: By carefully crafting the prompt to manipulate context, this step coaxes the model into generating the intended harmful output.

Empirical Evaluation

The DRA approach was tested on several advanced LLMs including both open-source (such as LLAMA-2 and Vicuna) and closed-source models like GPT-4. Results demonstrated a 90% success rate in GPT-4 chatbots, showcasing the strategy's efficacy. Notably, DRA achieved superior success rates with minimal queries compared to existing methods, highlighting its efficiency and adaptability.

Implications and Future Directions

The findings present profound implications for the development and deployment of LLMs, especially concerning security and ethical content generation. The demonstrated vulnerabilities highlight the need for robust defense mechanisms that extend beyond traditional safety fine-tuning. The authors suggest that future research should focus on developing comprehensive strategies to mitigate these inherent biases in LLM architectures.

DRA not only broadens the understanding of current security vulnerabilities in LLMs but also sets a new direction for enhancing AI safety. As LLM usage expands, ensuring their outputs remain beneficial and ethical becomes paramount, requiring ongoing evaluation and evolution of their safeguarding protocols.

PDF Markdown

Related Papers

YouTube

Show All Videos