Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (2402.16192v2)

Published 25 Feb 2024 in cs.CL

Abstract: Aligned LLMs are vulnerable to jailbreaking attacks, which bypass the safeguards of targeted LLMs and fool them into generating objectionable content. While initial defenses show promise against token-based threat models, there do not exist defenses that provide robustness against semantic attacks and avoid unfavorable trade-offs between robustness and nominal performance. To meet this need, we propose SEMANTICSMOOTH, a smoothing-based defense that aggregates the predictions of multiple semantically transformed copies of a given input prompt. Experimental results demonstrate that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks while maintaining strong nominal performance on instruction following benchmarks such as InstructionFollowing and AlpacaEval. The codes will be publicly available at https://github.com/UCSB-NLP-Chang/SemanticSmooth.

References (52)

Citations (28)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing (2402.16192v2)

Summary

Related Papers