Learning Human-like Representations to Enable Learning Human Values (2312.14106v3)
Abstract: How can we build AI systems that can learn any set of individual human values both quickly and safely, avoiding causing harm or violating societal standards for acceptable behavior during the learning process? We explore the effects of representational alignment between humans and AI agents on learning human values. Making AI systems learn human-like representations of the world has many known benefits, including improving generalization, robustness to domain shifts, and few-shot learning performance. We demonstrate that this kind of representational alignment can also support safely learning and exploring human values in the context of personalization. We begin with a theoretical prediction, show that it applies to learning human morality judgments, then show that our results generalize to ten different aspects of human values -- including ethics, honesty, and fairness -- training AI agents on each set of values in a multi-armed bandit setting, where rewards reflect human value judgments over the chosen action. Using a set of textual action descriptions, we collect value judgments from humans, as well as similarity judgments from both humans and multiple LLMs, and demonstrate that representational alignment enables both safe exploration and improved generalization when learning human values.
- Benchmarking Safe Exploration in Deep Reinforcement Learning.
- Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, 39–1. JMLR Workshop and Conference Proceedings.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Bickhard, M. H. 1993. Representational content in humans and machines. Journal of Experimental & Theoretical Artificial Intelligence, 5(4): 285–333.
- Aligning Robot and Human Representations. arXiv preprint arXiv:2302.01928.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
- Aligning AI With Shared Human Values. arXiv:2008.02275.
- What Would Jiminy Cricket Do? Towards Agents That Behave Morally. arXiv:2110.13136.
- Bia mitigation for machine learning classifiers: A comprehensive survey. arXiv preprint arXiv:2207.07068.
- AI Alignment: A Comprehensive Survey. arXiv:2310.19852.
- Adaptive sensitive reweighting to mitigate bias in fairness-aware classification. In Proceedings of the 2018 world wide web conference, 853–862.
- Predicting Human Similarity Judgments Using Large Language Models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44.
- What language reveals about perception: Distilling psychophysical knowledge from large language models. arXiv preprint arXiv:2302.01308.
- Words are all you need? Language as an approximation for human similarity judgments. In The Eleventh International Conference on Learning Representations.
- Language models and brain alignment: beyond word-level semantics and prediction. arXiv preprint arXiv:2212.00596.
- Human alignment of neural network representations. In The Eleventh International Conference on Learning Representations.
- Exploring alignment of representations with human perception. arXiv:2111.14726.
- Concept Alignment as a Prerequisite for Value Alignment. arXiv:2310.20059.
- What does the mind learn? A comparison of human and machine learning representations. Current opinion in neurobiology, 55: 97–102.
- Alignment with human representations supports robust few-shot learning. arXiv:2301.11990.
- Getting aligned on representational alignment. arXiv:2310.13018.
- Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.