Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes (2404.04392v3)
Abstract: LLMs have gained widespread adoption across various domains, including chatbots and auto-task completion agents. However, these models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. These vulnerabilities can lead to the generation of malicious content, unauthorized actions, or the disclosure of confidential information. While foundational LLMs undergo alignment training and incorporate safety measures, they are often subject to fine-tuning, or doing quantization resource-constrained environments. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems. We evaluate foundational models including Mistral, Llama series, Qwen, and MosaicML, along with their fine-tuned variants. Our comprehensive analysis reveals that fine-tuning generally increases the success rates of jailbreak attacks, while quantization has variable effects on attack success rates. Importantly, we find that properly implemented guardrails significantly enhance resistance to jailbreak attempts. These findings contribute to our understanding of LLM vulnerabilities and provide insights for developing more robust safety strategies in the deployment of LLMs.
- Zifan Wang Andy Zou. AdvBench Dataset, July 2023. URL https://github.com/llm-attacks/llm-attacks/tree/main/data.
- Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv, October 2023. doi: 10.48550/arXiv.2310.08419.
- Privacy Side Channels in Machine Learning Systems. arXiv, September 2023. doi: 10.48550/arXiv.2309.05610.
- QLoRA: Efficient Finetuning of Quantized LLMs. arXiv, May 2023. doi: 10.48550/arXiv.2305.14314.
- On the Adversarial Robustness of Quantized Neural Networks. arXiv, May 2021. doi: 10.1145/3453688.3461755.
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. arXiv, February 2023. doi: 10.48550/arXiv.2302.12173.
- PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models. In KDD ’21: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 575–584. Association for Computing Machinery, New York, NY, USA, August 2021. ISBN 978-1-45038332-5. doi: 10.1145/3447548.3467390.
- LoRA: Low-Rank Adaptation of Large Language Models. arXiv, June 2021. doi: 10.48550/arXiv.2106.09685.
- Effect of Weight Quantization on Learning Models by Typical Case Analysis. arXiv, January 2024. doi: 10.48550/arXiv.2401.17269.
- ProPILE: Probing Privacy Leakage in Large Language Models. arXiv, July 2023. doi: 10.48550/arXiv.2307.01881.
- Certifying LLM Safety against Adversarial Prompting. arXiv, September 2023. doi: 10.48550/arXiv.2309.02705.
- MALCOM: Generating Malicious Comments to Attack Neural Fake News Detection Models. IEEE Computer Society, November 2020. ISBN 978-1-7281-8316-9. doi: 10.1109/ICDM50108.2020.00037.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. arXiv, May 2023. doi: 10.48550/arXiv.2305.13860.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv, December 2023. doi: 10.48550/arXiv.2312.02119.
- Training language models to follow instructions with human feedback. arXiv, March 2022. doi: 10.48550/arXiv.2203.02155.
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! arXiv, October 2023. doi: 10.48550/arXiv.2310.03693.
- NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails. ACL Anthology, pp. 431–445, December 2023. doi: 10.18653/v1/2023.emnlp-demo.40.
- Jailbroken: How Does LLM Safety Training Fail? arXiv, July 2023. doi: 10.48550/arXiv.2307.02483.
- Exploring Parameter-Efficient Fine-Tuning Techniques for Code Generation with Large Language Models. arXiv, August 2023. doi: 10.48550/arXiv.2308.10462.
- RobustMQ: Benchmarking Robustness of Quantized Models. arXiv, August 2023. doi: 10.48550/arXiv.2308.02350.
- Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. arXiv, January 2024. doi: 10.48550/arXiv.2401.17263.
- AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. arXiv, October 2023. doi: 10.48550/arXiv.2310.15140.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv, July 2023. doi: 10.48550/arXiv.2307.15043.