Emergent Mind

Social Choice for AI Alignment: Dealing with Diverse Human Feedback

(2404.10271)
Published Apr 16, 2024 in cs.LG , cs.AI , cs.CL , cs.CY , and cs.GT

Abstract

Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, so that, for example, they refuse to comply with requests for help with committing crimes or with producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about ''collective'' preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.

RLCHF enhances standard RLHF by integrating a social welfare function, F, to aggregate preferences.

Overview

  • The paper addresses the challenge of aligning LLMs with diverse human values through the application of social choice theory, contrasting current methods like reinforcement learning from human feedback (RLHF) and Constitutional AI.

  • It critiques the RLHF and Constitutional AI approaches for their insufficient consideration of human diversity and proposes using social choice theory to better aggregate diverse human feedback into AI behavior guidelines.

  • The introduction of Reinforcement Learning from Collective Human Feedback (RLCHF) and Supervised Learning from Simulated Collective Decisions (SLSCD) as novel frameworks that embed social choice principles directly into the AI training process, aiming for more representative AI systems.

  • The paper emphasizes the need for multidisciplinary efforts to integrate social choice theory insights into AI development, offering a path forward for creating AI systems aligned with a broad spectrum of human values and preferences.

Social Choice for AI Alignment: A Framework for Incorporating Diverse Human Feedback

Introduction to the Challenge

The development and fine-tuning of LLMs incorporating human feedback have highlighted significant challenges, especially regarding the diversity and potential divergence of human input. This paper discusses the relevancy and application of social choice theory as a structured approach to these problems. Specifically, it scrutinizes the difficulties inherent in reinforcement learning from human feedback (RLHF) and proposes a more principled method to align LLMs with collective human values and preferences.

Value Alignment and RLHF: Current State

Value alignment in AI systems focuses on ensuring AI behaves in a way that aligns with human values. Reinforcement learning from human feedback (RLHF) has been critical in aligning pretrained LLMs with these values. However, the RLHF approach faces significant limitations, including challenges in dealing with unrepresentative data, oversimplified models of human decision-making, and lack of consideration for human diversity. This paper critically evaluates RLHF and Constitutional AI as prevailing methodologies, highlighting their insufficiency in effectively capturing and reflecting collective human preferences in AI behavior.

Constitutional AI and the Promise of Social Choice Theory

The paper contrasts RLHF with Constitutional AI (CAI), presenting CAI's approach of employing high-level human-written principles for AI training. It argues that both methods inadequately address the aggregation of diverse human input into a coherent set of guidelines for AI behavior, a gap effectively bridged by social choice theory. By leveraging social choice, the paper claims we can avoid na\"ive aggregation pitfalls, such as cyclical preferences or inconsistencies, ensuring AI systems better represent collective human judgments.

The Role of Computational Social Choice in AI Alignment

Computational social choice offers a rich toolkit for aggregating individual preferences, judgments, or principles into collective decisions. This paper argues for its application in tackling key questions in AI alignment, such as identifying relevant stakeholders for feedback, formatting and aggregating diverse types of feedback, and making collective decisions on AI behavior from this feedback. Through computational social choice, concerns about fairness, accuracy, and inclusivity in feedback collection and aggregation can be systematically addressed.

Novel Frameworks: From RLHF to RLCHF and Beyond

The paper introduces Reinforcement Learning from Collective Human Feedback (RLCHF) as a novel framework that integrates social choice directly into the RLHF process. This integration allows for the aggregation of individual judgments into a collective feedback mechanism before fine-tuning AI models, potentially resulting in fairer and more representative AI systems. Furthermore, it explores the potential of Supervised Learning from Simulated Collective Decisions (SLSCD) as an approach that utilizes social choice theory not just for preference aggregation but for simulating collective decisions to guide AI behavior directly.

Key Concepts in Social Choice Relevant to AI

Highlighting the relevance of concepts such as independence of clones, strategic voting, anonymity, and principles as voters, the paper discusses how these traditional social choice considerations can meaningfully inform the design and implementation of AI systems aligned with collective human values. It also hints at exploring cooperative AI to manage the potential interaction between multiple AIs trained with differing collective inputs.

Addressing Behavioral and Multi-Agent Considerations

The paper acknowledges the complexity introduced by human behavioral factors in preference elicitation and the potential for strategic manipulation of feedback. It suggests further research to understand and mitigate these effects. Additionally, it contemplates the scenario of navigating interactions between multiple AIs aligned to different collective preferences, emphasizing the need for cooperation and conflict avoidance.

Conclusion and Path Forward

Concluding, the paper urges a multidisciplinary effort, crossing the boundaries between AI ethics, safety research, and social choice theory, to develop principled and practical methods for incorporating diverse human preferences into AI systems. By systematically applying insights from social choice, the field can make significant strides towards creating AI systems that are truly aligned with the broad spectrum of human values and preferences.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.