Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (2310.06387v3)
Abstract: LLMs have shown remarkable success in various tasks, yet their safety and the risk of generating harmful content remain pressing concerns. In this paper, we delve into the potential of In-Context Learning (ICL) to modulate the alignment of LLMs. Specifically, we propose the In-Context Attack (ICA) which employs harmful demonstrations to subvert LLMs, and the In-Context Defense (ICD) which bolsters model resilience through examples that demonstrate refusal to produce harmful responses. We offer theoretical insights to elucidate how a limited set of in-context demonstrations can pivotally influence the safety alignment of LLMs. Through extensive experiments, we demonstrate the efficacy of ICA and ICD in respectively elevating and mitigating the success rates of jailbreaking prompts. Our findings illuminate the profound influence of ICL on LLM behavior, opening new avenues for improving the safety of LLMs.
- Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
- Constitutional ai: Harmlessness from ai feedback, 2022.
- Bad characters: Imperceptible nlp attacks, 2021.
- Language models are few-shot learners, 2020.
- Adversarial examples are not easily detected: Bypassing ten detection methods, 2017a.
- Towards evaluating the robustness of neural networks, 2017b.
- Explore, establish, exploit: Red teaming language models from scratch, 2023.
- A survey on in-context learning, 2023.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
- Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- On the (statistical) detection of adversarial examples, 2017.
- Gradient-based adversarial attacks against text transformers, 2021.
- Instruction induction: From few examples to natural language task descriptions, 2022.
- Baseline defenses for adversarial attacks against aligned language models, 2023.
- Is bert really robust? a strong baseline for natural language attack on text classification and entailment, 2020.
- Automatically auditing large language models via discrete optimization, 2023.
- Pretraining language models with human preferences, 2023.
- Certifying llm safety against adversarial prompting, 2023.
- Rain: Your language models can align themselves without finetuning, 2023.
- Are emergent abilities in large language models just in-context learning?, 2023.
- Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Noisy channel language model prompting for few-shot text classification. In ACL, 2022.
- Diffusion models for adversarial purification. arXiv preprint arXiv:2205.07460, 2022.
- Training language models to follow instructions with human feedback, 2022.
- Red teaming language models with language models, 2022.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Transformers learn in-context by gradient descent. In ICML, 2023.
- Adversarial demonstration attacks on large language models, 2023a.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning. In Workshop on Efficient Systems for Foundation Models, 2023b.
- Self-instruct: Aligning language models with self-generated instructions, 2023c.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023.
- $k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference. In ICLR, 2023.
- Active example selection for in-context learning, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.