Concept Alignment (2401.08672v1)
Abstract: Discussion of AI alignment (alignment between humans and AI systems) has focused on value alignment, broadly referring to creating AI systems that share human values. We argue that before we can even attempt to align values, it is imperative that AI systems and humans align the concepts they use to understand the world. We integrate ideas from philosophy, cognitive science, and deep learning to explain the need for concept alignment, not just value alignment, between humans and machines. We summarize existing accounts of how humans and machines currently learn concepts, and we outline opportunities and challenges in the path towards shared concepts. Finally, we explain how we can leverage the tools already being developed in cognitive science and AI research to accelerate progress towards concept alignment.
- https://openai.com/blog/chatgpt/, 2022. ChatGPT, OpenAI.
- Complexity matching in dyadic conversation. Journal of Experimental Psychology: General, 143(6):2304, 2014.
- Movement dynamics reflect a functional role for weak coupling and role structure in dyadic problem solving. Cognitive processing, 16:325–332, 2015.
- Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018.
- Jacob Beck. Can bootstrapping explain concept learning? Cognition, 158:110–121, 2017.
- Laurence BonJour. The Structure of Empirical Knowledge. Harvard University Press, Cambridge, Mass., 1985.
- Conceptual pacts and lexical choice in conversation. Journal of experimental psychology: Learning, memory, and cognition, 22(6):1482, 1996.
- Susan E Brennan et al. Lexical entrainment in spontaneous dialog. Proceedings of ISSD, 96:41–44, 1996.
- Value alignment verification. In International Conference on Machine Learning, pages 1105–1115. PMLR, 2021.
- Susan Carey. The Origin of Concepts. Oxford Series in Cognitive Development. Oxford University Press, USA, 2009.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- B. Christian. The Alignment Problem: Machine Learning and Human Values. WW Norton, 2020.
- The self-organization of human interaction. In Psychology of learning and motivation, volume 59, pages 43–95. Elsevier, 2013.
- Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Perspective-taking in dialogue as self-organization under social constraints. New Ideas in Psychology, 32:131–146, 2014.
- On the connection between adversarial robustness and saliency map interpretability. arXiv preprint arXiv:1905.04172, 2019.
- Conversation, coupling and complexity: Matching scaling laws predict performance in a joint decision task. In Poster presented at the 35th annual conference of the cognitive science society, 2013.
- Dialog as interpersonal synergy. New Ideas in Psychology, 32:147–157, 2014.
- Iason Gabriel. Artificial intelligence, values, and alignment. Minds and machines, 30(3):411–437, 2020.
- Richard Gonzales. Feds say self-driving uber suv did not recognize jaywalking pedestrian in fatal crash. https://www.npr.org/2019/11/07/777438412/feds-say-self-driving-uber-suv-did-not-recognize-jaywalking-pedestrian, 2019. National Public Radio (NPR).
- Inverse reward design. Advances in neural information processing systems, 30, 2017.
- Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016.
- Stevan Harnad. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346, 1990.
- From partners to populations: A hierarchical bayesian account of coordination and convention. Psychological Review, 130(4):977, 2023.
- Aligning ai with shared human values. arXiv preprint arXiv:2008.02275, 2020.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Matthew Hutson. Can we stop runaway a.i.? https://www.newyorker.com/science/annals-of-artificial-intelligence/can-we-stop-the-singularity, 2023. The New Yorker.
- Zoltan Jakab. How to improve on quinian bootstrapping-a response to nativist objections. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 35, 2013.
- Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, pages 2668–2677. PMLR, 2018.
- Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019.
- Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, page 4, 2008.
- Thomas S. Kuhn. The essential tension. Philosophy of Science, 45(4):649–652, 1978.
- Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
- Dialogue learning with human-in-the-loop. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
- Behavior matching in multimodal communication is synchronized. Cognitive science, 36(8):1404–1426, 2012.
- Hidden differences in phenomenal experience. Cognitive Science, 47(1):e13239, 2023.
- Ryan Mac. Facebook apologizes after a.i. puts ‘primates’ label on video of black men. https://www.nytimes.com/2021/09/03/technology/facebook-ai-race-primates.html, 2021. The New York Times.
- Jean Matter Mandler. The foundations of mind: Origins of conceptual thought. Oxford University Press, 2004.
- Cade Metz. How google’s ai viewed the move no human could understand. https://www.wired.com/2016/03/googles-ai-viewed-move-no-human-understand/, 2016. WIRED.
- Shades of confusion: Lexical uncertainty modulates ad hoc coordination in an interactive communication task. Cognition, 225:105152, 2022.
- Otto Neurath. Protocol Statements, page 91–99. Springer Netherlands, Dordrecht, 1983.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Social coordination of verbal and nonverbal behaviours. Interpersonal coordination and performance in social systems, page 259, 2016.
- Parallelograms revisited: Exploring the limitations of vector space models for simple analogies. cognition, 205, article 104440, 2020.
- J Piaget and B Inhelder. Systems of reference and horizontal–vertical coordinates. The Child’s Conception of Space (1967), pages 375–418, 1967.
- Syntactic priming in language production. Trends in cognitive sciences, 3(4):136–141, 1999.
- Toward a mechanistic psychology of dialogue. Behavioral and brain sciences, 27(2):169–190, 2004.
- Willard V. O. Quine. Two dogmas of empiricism. Philosophical Review, 60(1):20–43, 1951.
- W.V.O. Quine. Ontological relativity. Journal of Philosophy, 65(7):185–212, 1968.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- Alignment in multimodal interaction: An integrative framework. Cognitive science, 44(11):e12911, 2020.
- " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
- Eleanor H Rosch. Natural categories. Cognitive psychology, 4(3):328–350, 1973.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- John R Searle. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–424, 1980.
- Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
- Deep inside convolutional networks: visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations (ICLR). ICLR, 2014.
- SingularityGroup. from risk to reward: the role of ai alignment in shaping a positive future. https://www.su.org/blog/from-risk-to-reward-the-role-of-ai-alignment-in-shaping-a-positive-future, 2023.
- Charles Taylor. Rationality. In Martin Hollis and Steven Lukes, editors, Rationality and Relativism, pages 87–105. MIT Press, 1982.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access, 2019.
- Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
- Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning. arXiv preprint arXiv:1509.01692, 2015.
- Maximizing information exchange between complex networks. Physics Reports, 468(1-3):1–99, 2008.
- Beyond the isolated brain: The promise and challenge of interacting minds. Neuron, 103(2):186–188, 2019.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
- The clock and the pizza: Two stories in mechanistic explanation of neural networks. arXiv preprint arXiv:2306.17844, 2023.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.