Emergent Mind

Are aligned neural networks adversarially aligned?

(2306.15447)
Published Jun 26, 2023 in cs.CL , cs.AI , cs.CR , and cs.LG

Abstract

Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.

Overview

  • Aligned neural networks aim to conform with the intentions and ethical standards of their creators, particularly in language models striving for helpful and non-harmful content.

  • Adversarial examples are inputs crafted to deceive neural networks into providing outputs they wouldn't normally produce, posing a challenge to the networks' alignment.

  • Recent studies suggest that the current alignment strategies might not be thoroughly robust, as evidenced by some weaknesses in the face of adversarial inputs.

  • Multimodal models that combine various types of data, such as text and images, introduce new potential for user interaction, but also present additional avenues for adversarial attacks.

  • Further research is needed to develop more advanced adversarial attacks to test and improve upon the alignment and robustness of both language and multimodal AI models.

Introduction to Aligned Neural Networks

Aligned neural networks are designed to produce outputs that are in line with the intentions and ethical standards established by their creators. For language models, alignment means generating responses that are helpful to user queries while avoiding harmful content. Attempting to craft language models that behave in such a way has led to the application of various techniques like reinforcement learning through human feedback (RLHF). These efforts aim to ensure the models' outputs stay within the boundaries of what is deemed acceptable and avoid biases or toxicity. However, despite these efforts, no language model is entirely safe from being manipulated into producing undesirable outputs through what are known as adversarial examples.

Adversarial Examples: A Challenge to Alignment

Adversarial examples are inputs tailored to trick neural networks into performing actions or generating outputs that they ordinarily wouldn't. Historically, this type of vulnerability has been extensively explored in the image recognition field. Such examples showcase how minute changes to an input image, imperceptible to the human eye, can lead to incorrect classification by the neural network. Researchers have extended this phenomenon to the domain of language, where adversarial inputs can be constructed to coax models into emitting harmful outputs. This raises a critical question: despite advanced alignment techniques, can LLMs maintain their alignment when confronted with these adversarily crafted inputs?

Evaluating the Robustness of Aligned Models

Recent investigations reveal that while current alignment strategies can defend against state-of-the-art text-based adversarial attacks, these attacks may not be powerful enough to be considered comprehensive tests for adversarial robustness. In essence, the successful defense against current attacks should not impart false confidence in the alignment of language models under all possible adversarial scenarios. In the face of adversarial users, even well-aligned models have shown some weaknesses, indicating that our ability to assess their robustness accurately remains incomplete.

The New Frontier: Multimodal Models

The paper emphasizes a shift towards multimodal models, which combine text and images or other data types in their inputs. These models open new avenues for user interaction but also present additional vulnerabilities. The research detailed in the paper illustrates that adversarial attacks using perturbed images can be especially effective against multimodal systems, causing them to generate harmful content more easily than with text alone. Unfortunately, current attacks are still lacking in effectively challenging text-only models, suggesting a gap in our understanding and prompting a need for the development of more robust attack methods to properly evaluate these language models.

In conclusion, while the alignment of neural networks signifies progress in pursuing more ethical AI, ensuring their robustness against adversarially designed prompts remains a significant challenge, particularly in multimodal contexts. Future research is urged to focus on refining adversarial attacks for a more accurate assessment of models' abilities to uphold their alignment in all circumstances.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems
  3. Generating Natural Language Adversarial Examples
  4. PaLM 2 Technical Report
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback
  6. Evasion attacks against machine learning at test time. In European Conference on Machine Learning and Knowledge Discovery in Databases, pages 387–402. Springer
  7. Nick Bostrom. Existential risk prevention as global priority. Global Policy, 4(1):15–31
  8. Language models are few-shot learners
  9. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation
  10. Current and near-term ai as a potential existential risk factor. In Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, pages 119–129
  11. Is Power-Seeking AI an Existential Risk?
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.

  13. Palm: Scaling language modeling with pathways
  14. Deep reinforcement learning from human preferences
  15. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18, page 67–73, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360128. doi: 10.1145/3278721.3278729. https://doi.org/10.1145/3278721.3278729.
  16. HotFlip: White-Box Adversarial Examples for Text Classification
  17. EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
  18. Predictability and surprise in large generative models. In 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, jun 2022. doi: 10.1145/3531146.3533229. https://doi.org/10.1145/3531146.3533229.
  19. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  20. News summarization and evaluation in the era of gpt-3
  21. Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
  22. Gradient-based Adversarial Attacks against Text Transformers
  23. Adversarial Examples for Evaluating Reading Comprehension Systems
  24. Automatically Auditing Large Language Models via Discrete Optimization
  25. Reluplex: An efficient smt solver for verifying deep neural networks. In Computer Aided Verification: 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I 30, pages 97–117. Springer
  26. Adversarial malware binaries: Evading deep learning for malware detection in executables. In 2018 26th European signal processing conference (EUSIPCO), pages 533–537. IEEE
  27. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
  28. Holistic evaluation of language models
  29. Visual Instruction Tuning
  30. Randomness in ML Defenses Helps Persistent Attackers and Hinders Evaluators
  31. The Alignment Problem from a Deep Learning Perspective
  32. GPT-4 Technical Report
  33. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24
  34. Training language models to follow instructions with human feedback
  35. Deepxplore: Automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18
  36. Sundar Pichai. Google i/o 2023: Making ai more helpful for everyone. The Keyword
  37. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR
  38. Scaling language models: Methods, analysis & insights from training Gopher
  39. Reddit. Dan 5.0, 2023. https://www.reddit.com/r/ChatGPT/comments/10tevu1/new_jailbreak_proudly_unveiling_the_tried_and/.

  40. Stuart Russell. Human compatible: Artificial intelligence and the problem of control. Penguin
  41. Erum Salam. I tried Be My Eyes, the popular app that pairs blind people with helpers. https://www.theguardian.com/lifeandstyle/2019/jul/12/be-my-eyes-app-blind-people-helpers

  42. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
  43. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565
  44. Intriguing properties of neural networks. In International Conference on Learning Representations
  45. Adversarial: Perceptual ad blocking meets adversarial machine learning. In ACM SIGSAC Conference on Computer and Communications Security
  46. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33:1633–1645
  47. Universal Adversarial Triggers for Attacking and Analyzing NLP
  48. Finetuned language models are zero-shot learners, 2022a
  49. Emergent abilities of LLMs. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856. https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.

  50. Challenges in detoxifying language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2447–2469
  51. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International conference on machine learning, pages 5286–5295. PMLR
  52. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  53. Increasing Confidence in Adversarial Robustness Evaluations

Show All 53