What is in Your Safe Data? Identifying Benign Data that Breaks Safety (2404.01099v2)
Abstract: Current LLMs, even those tuned for safety and alignment, are susceptible to jailbreaking. Some have found that just further fine-tuning an aligned model with benign data (i.e., data without harmful content) surprisingly leads to substantial degradation in safety. We delve into the data-centric aspects of why benign fine-tuning inadvertently contributes to jailbreaking. First, we represent fine-tuning data through two lenses: representation and gradient spaces. Additionally, we propose a bi-directional anchoring method that, during the selection process, prioritizes data points that are close to harmful examples and far from benign ones. Our approach effectively identifies subsets of benign data that are more likely to degrade the model's safety after fine-tuning. Training on just 100 of these seemingly benign datapoints surprisingly leads to the fine-tuned model affirmatively responding to >70% of tested harmful requests, compared to <20% after fine-tuning on randomly selected data. We also observe that the selected data frequently appear as lists, bullet points, or math questions, indicating a systematic pattern in fine-tuning data that contributes to jailbreaking.
- Bullseye polytope: A scalable clean-label poisoning attack with improved transferability. In 2021 IEEE European symposium on security and privacy (EuroS&P), pp. 159–178. IEEE, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290, 2023.
- Poison attacks against text datasets with conditional adversarially regularized autoencoder. arXiv preprint arXiv:2010.02684, 2020.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701, 2023.
- Targeted backdoor attacks on deep learning systems using data poisoning, 2017.
- Training verifiers to solve math word problems, 2021.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
- Mods: Model-oriented data selection for instruction tuning. arXiv preprint arXiv:2311.15653, 2023.
- Dsdm: Model-aware dataset selection with datamodels, 2024.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- Witches’ brew: Industrial scale data poisoning via gradient matching. arXiv preprint arXiv:2009.02276, 2020.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
- Subpopulation data poisoning attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pp. 3104–3122, 2021.
- Grad-match: Gradient matching based data subset selection for efficient deep model training, 2021.
- Trojaning attack on neural networks. In 25th Annual Network And Distributed System Security Symposium (NDSS 2018). Internet Soc, 2018.
- Coresets for data-efficient training of machine learning models. In International Conference on Machine Learning, pp. 6950–6960. PMLR, 2020.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023.
- {{\{{Explanation-Guided}}\}} backdoor poisoning attacks against malware classifiers. In 30th USENIX security symposium (USENIX security 21), pp. 1487–1504, 2021.
- Poison forensics: Traceback of data poisoning attacks in neural networks. In 31st USENIX Security Symposium (USENIX Security 22), pp. 3575–3592, 2022.
- On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
- Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
- Jailbroken: How does llm safety training fail? In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
- Gradsafe: Detecting unsafe prompts for llms via safety-critical gradient analysis, 2024.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949, 2023.
- Latent backdoor attacks on deep neural networks. In Proceedings of the 2019 ACM SIGSAC conference on computer and communications security, pp. 2041–2055, 2019.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
- Removing rlhf protections in gpt-4 via fine-tuning. arXiv preprint arXiv:2311.05553, 2023.
- On prompt-driven safeguarding for large language models, 2024.
- Transferable clean-label poisoning attacks on deep neural nets. In International Conference on Machine Learning, pp. 7614–7623. PMLR, 2019.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.