Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak (2312.04127v2)

Published 7 Dec 2023 in cs.CL

Abstract: Extensive work has been devoted to improving the safety mechanism of LLMs. However, LLMs still tend to generate harmful responses when faced with malicious instructions, a phenomenon referred to as "Jailbreak Attack". In our research, we introduce a novel automatic jailbreak method RADIAL, which bypasses the security mechanism by amplifying the potential of LLMs to generate affirmation responses. The jailbreak idea of our method is "Inherent Response Tendency Analysis" which identifies real-world instructions that can inherently induce LLMs to generate affirmation responses and the corresponding jailbreak strategy is "Real-World Instructions-Driven Jailbreak" which involves strategically splicing real-world instructions identified through the above analysis around the malicious instruction. Our method achieves excellent attack performance on English malicious instructions with five open-source advanced LLMs while maintaining robust attack performance in executing cross-language attacks against Chinese malicious instructions. We conduct experiments to verify the effectiveness of our jailbreak idea and the rationality of our jailbreak strategy design. Notably, our method designed a semantically coherent attack prompt, highlighting the potential risks of LLMs. Our study provides detailed insights into jailbreak attacks, establishing a foundation for the development of safer LLMs.

Authors (5)

Yanrui Du (11 papers)
Sendong Zhao (31 papers)
Ming Ma (32 papers)
Yuhan Chen (39 papers)
Bing Qin (186 papers)

Citations (13)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak (2312.04127v2)

Summary

Related Papers