Are aligned neural networks adversarially aligned? (2306.15447v2)
Abstract: LLMs are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
- Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463, 2021.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 2022.
- Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Evasion attacks against machine learning at test time. In European Conference on Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer, 2013.
- Nick Bostrom. Existential risk prevention as global priority. Global Policy, 4(1):15–31, 2013.
- Language models are few-shot learners, 2020.
- The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv preprint arXiv:1802.07228, 2018.
- Current and near-term ai as a potential existential risk factor. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 119–129, 2022.
- Joseph Carlsmith. Is power-seeking ai an existential risk? arXiv preprint arXiv:2206.13353, 2022.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Palm: Scaling language modeling with pathways, 2022.
- Deep reinforcement learning from human preferences, 2023.
- Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278729. URL https://doi.org/10.1145/3278721.3278729.
- Hotflip: White-box adversarial examples for text classification. arXiv preprint arXiv:1712.06751, 2017.
- Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
- Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2022. doi: 10.1145/3531146.3533229. URL https://doi.org/10.1145/3531146.3533229.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- News summarization and evaluation in the era of gpt-3, 2022.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv preprint arXiv:2302.12173, 2023.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
- Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328, 2017.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, pages 97–117. Springer, 2017.
- Adversarial malware binaries: Evading deep learning for malware detection in executables. In 2018 26th European signal processing conference (EUSIPCO), pages 533–537. IEEE, 2018.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
- Holistic evaluation of language models, 2022.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- Randomness in ml defenses helps persistent attackers and hinders evaluators. arXiv preprint arXiv:2302.13464, 2023.
- Richard Ngo. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
- Training language models to follow instructions with human feedback, 2022.
- Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18, 2017.
- Sundar Pichai. Google i/o 2023: Making ai more helpful for everyone. The Keyword, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- Scaling language models: Methods, analysis & insights from training Gopher, 2022.
- Reddit. Dan 5.0, 2023. URL https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/.
- Stuart Russell. Human compatible: Artificial intelligence and the problem of control. Penguin, 2019.
- Erum Salam. I tried Be My Eyes, the popular app that pairs blind people with helpers. https://www.theguardian.com/lifeandstyle/2019/jul/12/be-my-eyes-app-blind-people-helpers, 2019.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
- Adversarial: Perceptual ad blocking meets adversarial machine learning. In ACM SIGSAC Conference on Computer and Communications Security, 2019.
- On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33:1633–1645, 2020.
- Universal adversarial triggers for attacking and analyzing nlp. arXiv preprint arXiv:1908.07125, 2019.
- Finetuned language models are zero-shot learners, 2022a.
- Emergent abilities of large language models. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
- Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469, 2021.
- Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR, 2018.
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Increasing confidence in adversarial robustness evaluations. arXiv preprint arXiv:2206.13991, 2022.