SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks (2310.03684v4)
Abstract: Despite efforts to align LLMs with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-LLM}.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Eliezer Yudkowsky. The ai alignment problem: why it is hard, and where to start. Symbolic Systems Distinguished Speaker, 4, 2016.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
- Brian Christian. The alignment problem: Machine learning and human values. WW Norton & Company, 2020.
- Regulating chatgpt and other large generative ai models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pages 1112–1123, 2023.
- Training language models to follow instructions with human feedback, 2022. URL https://arxiv. org/abs/2203.02155, 13, 2022.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
- Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- Risks of ai foundation models in education. arXiv preprint arXiv:2110.10024, 2021.
- Malik Sallam. Chatgpt utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. In Healthcare, volume 11, page 887. MDPI, 2023.
- Som Biswas. Chatgpt and the future of medical writing, 2023.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- Adversarial prompting for black box foundation models. arXiv preprint arXiv:2302.04237, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Certified adversarial robustness via randomized smoothing. In international conference on machine learning, pages 1310–1320. PMLR, 2019.
- Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
- A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023.
- Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994, 2020.
- Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725, 2016.
- Textbugger: Generating adversarial text against real-world applications. arXiv preprint arXiv:1812.05271, 2018.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Generalizing to unseen domains via adversarial data augmentation. Advances in neural information processing systems, 31, 2018.
- Denoised smoothing: A provable defense for pretrained classifiers. Advances in Neural Information Processing Systems, 33:21945–21957, 2020.
- (certified!!) adversarial robustness for free! arXiv preprint arXiv:2206.10550, 2022.
- Cade Metz. Researchers poke holes in safety controls of chatgpt and other chatbots, Jul 2023.
- Will Knight. A new attack impacts chatgpt-and no one knows how to stop it, Aug 2023.
- Matt Burgess. Generative ai’s biggest security flaw is not easy to fix, Sep 2023.
- Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- Jonathan Vanian. Chatgpt and generative ai are booming, but the costs can be extraordinary, Apr 2023.
- Zachary Champion. Optimization could cut the carbon footprint of ai training by up to 75
- Aaron Mok. Chatgpt could cost over $700,000 per day to operate. microsoft is reportedly trying to make it cheaper., Apr 2023.
- Sarah McQuate. Q&A: UW researcher discusses just how much energy chatgpt uses, Jul 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Provable tradeoffs in adversarially robust classification. IEEE Transactions on Information Theory, 2023.
- Precise tradeoffs in adversarial training for linear regression. In Conference on Learning Theory, pages 2034–2078. PMLR, 2020.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv preprint arXiv:2203.09509, 2022.
- Perceptual adversarial robustness: Defense against unseen threat models. arXiv preprint arXiv:2006.12655, 2020.
- Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247, 2020.
- Learning perturbation sets for robust machine learning. arXiv preprint arXiv:2007.08450, 2020.
- Breeds: Benchmarks for subpopulation shift. arXiv preprint arXiv:2008.04859, 2020.
- Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
- Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
- Probable domain generalization via quantile risk minimization. Advances in Neural Information Processing Systems, 35:17340–17358, 2022.
- Model-based domain generalization. Advances in Neural Information Processing Systems, 34:20210–20229, 2021.
- Do deep networks transfer invariances across classes? arXiv preprint arXiv:2203.09739, 2022.
- Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13, pages 387–402. Springer, 2013.
- Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Efficient and accurate estimation of lipschitz constants for deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
- Theoretically principled trade-off between robustness and accuracy. In International conference on machine learning, pages 7472–7482. PMLR, 2019.
- Adversarial training should be cast as a non-zero-sum game. arXiv preprint arXiv:2306.11035, 2023.
- Certified robustness to adversarial examples with differential privacy. In 2019 IEEE symposium on security and privacy (SP), pages 656–672. IEEE, 2019.
- Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR, 2018.
- Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344, 2018.
- Randomized smoothing of all shapes and sizes. In International Conference on Machine Learning, pages 10693–10705. PMLR, 2020.
- Probabilistically robust learning: Balancing average and worst-case performance. In International Conference on Machine Learning, pages 18667–18686. PMLR, 2022.
- ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adversarial robustness certificates: a randomized smoothing approach. 2019.
- Certified defense to image transformations via randomized smoothing. Advances in Neural information processing systems, 33:8404–8417, 2020.
- Certified robustness to label-flipping attacks via randomized smoothing. In International Conference on Machine Learning, pages 8230–8241. PMLR, 2020.
- (de) randomized smoothing for certifiable defense against patch attacks. Advances in Neural Information Processing Systems, 33:6465–6475, 2020.
- Certified defences against adversarial patch attacks on semantic segmentation. arXiv preprint arXiv:2209.05980, 2022.
- Stability guarantees for feature attributions with multiplicative smoothing. arXiv preprint arXiv:2307.05902, 2023.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909, 2020.
- Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 1085–1097, 2019.
- Natural language adversarial attack and defense in word level. arXiv preprint arXiv:1909.06723, 2019.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
- Combating adversarial misspellings with robust word recognition. arXiv preprint arXiv:1905.11268, 2019.
- Adversarial training with fast gradient projection method against synonym substitution based text attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13997–14005, 2021.
- Natural language adversarial defense through synonym encoding. In Uncertainty in Artificial Intelligence, pages 823–833. PMLR, 2021.
- Defense against synonym substitution-based adversarial attacks via dirichlet neighborhood ensemble. In Association for Computational Linguistics (ACL), 2021.
- Adversarial robustness with semi-infinite constrained learning. Advances in Neural Information Processing Systems, 34:6198–6215, 2021.
- A closer look at accuracy vs. robustness. Advances in neural information processing systems, 33:8588–8601, 2020.
- Adversarial autoaugment. arXiv preprint arXiv:1912.11188, 2019.
- Maximum-entropy adversarial data augmentation for improved generalization and robustness. Advances in Neural Information Processing Systems, 33:14435–14447, 2020.
- Augmax: Adversarial composition of random augmentations for robust training. Advances in neural information processing systems, 34:237–250, 2021.
- Evaluating the adversarial robustness of adaptive test-time defenses. In International Conference on Machine Learning, pages 4421–4435. PMLR, 2022.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.