Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback (2404.10271v2)

Published 16 Apr 2024 in cs.LG, cs.AI, cs.CL, cs.CY, and cs.GT

Abstract: Foundation models such as GPT-4 are fine-tuned to avoid unsafe or otherwise problematic behavior, such as helping to commit crimes or producing racist text. One approach to fine-tuning, called reinforcement learning from human feedback, learns from humans' expressed preferences over multiple outputs. Another approach is constitutional AI, in which the input from humans is a list of high-level principles. But how do we deal with potentially diverging input from humans? How can we aggregate the input into consistent data about "collective" preferences or otherwise use it to make collective choices about model behavior? In this paper, we argue that the field of social choice is well positioned to address these questions, and we discuss ways forward for this agenda, drawing on discussions in a recent workshop on Social Choice for AI Ethics and Safety held in Berkeley, CA, USA in December 2023.

References (105)

Citations (17)

View on Semantic Scholar

Summary

The paper introduces RLCHF, a novel framework that leverages social choice theory to aggregate collective human feedback for AI alignment.
It critically evaluates RLHF and Constitutional AI, highlighting the limitations of these methods in capturing diverse human preferences.
The study demonstrates that computational social choice can enhance fairness, accuracy, and inclusivity in training large language models.

Introduction to the Challenge

The development and fine-tuning of LLMs incorporating human feedback have highlighted significant challenges, especially regarding the diversity and potential divergence of human input. This paper discusses the relevancy and application of social choice theory as a structured approach to these problems. Specifically, it scrutinizes the difficulties inherent in reinforcement learning from human feedback (RLHF) and proposes a more principled method to align LLMs with collective human values and preferences.

Value Alignment and RLHF: Current State

Value alignment in AI systems focuses on ensuring AI behaves in a way that aligns with human values. Reinforcement learning from human feedback (RLHF) has been critical in aligning pretrained LLMs with these values. However, the RLHF approach faces significant limitations, including challenges in dealing with unrepresentative data, oversimplified models of human decision-making, and lack of consideration for human diversity. This paper critically evaluates RLHF and Constitutional AI as prevailing methodologies, highlighting their insufficiency in effectively capturing and reflecting collective human preferences in AI behavior.

The paper contrasts RLHF with Constitutional AI (CAI), presenting CAI's approach of employing high-level human-written principles for AI training. It argues that both methods inadequately address the aggregation of diverse human input into a coherent set of guidelines for AI behavior, a gap effectively bridged by social choice theory. By leveraging social choice, the paper claims we can avoid na\"ive aggregation pitfalls, such as cyclical preferences or inconsistencies, ensuring AI systems better represent collective human judgments.

Computational social choice offers a rich toolkit for aggregating individual preferences, judgments, or principles into collective decisions. This paper argues for its application in tackling key questions in AI alignment, such as identifying relevant stakeholders for feedback, formatting and aggregating diverse types of feedback, and making collective decisions on AI behavior from this feedback. Through computational social choice, concerns about fairness, accuracy, and inclusivity in feedback collection and aggregation can be systematically addressed.

Novel Frameworks: From RLHF to RLCHF and Beyond

The paper introduces Reinforcement Learning from Collective Human Feedback (RLCHF) as a novel framework that integrates social choice directly into the RLHF process. This integration allows for the aggregation of individual judgments into a collective feedback mechanism before fine-tuning AI models, potentially resulting in fairer and more representative AI systems. Furthermore, it explores the potential of Supervised Learning from Simulated Collective Decisions (SLSCD) as an approach that utilizes social choice theory not just for preference aggregation but for simulating collective decisions to guide AI behavior directly.

Highlighting the relevance of concepts such as independence of clones, strategic voting, anonymity, and principles as voters, the paper discusses how these traditional social choice considerations can meaningfully inform the design and implementation of AI systems aligned with collective human values. It also hints at exploring cooperative AI to manage the potential interaction between multiple AIs trained with differing collective inputs.

Addressing Behavioral and Multi-Agent Considerations

The paper acknowledges the complexity introduced by human behavioral factors in preference elicitation and the potential for strategic manipulation of feedback. It suggests further research to understand and mitigate these effects. Additionally, it contemplates the scenario of navigating interactions between multiple AIs aligned to different collective preferences, emphasizing the need for cooperation and conflict avoidance.

Conclusion and Path Forward

Concluding, the paper urges a multidisciplinary effort, crossing the boundaries between AI ethics, safety research, and social choice theory, to develop principled and practical methods for incorporating diverse human preferences into AI systems. By systematically applying insights from social choice, the field can make significant strides towards creating AI systems that are truly aligned with the broad spectrum of human values and preferences.

PDF Markdown

Tweets

https://twitter.com/conitzer/status/1780556706696626338

https://twitter.com/conitzer/status/1786844238316961805

https://twitter.com/FreedmanRach/status/1781017450978685437

https://twitter.com/conitzer/status/1809541885901299746

https://twitter.com/fly51fly/status/1780717863969640754

https://twitter.com/arxivsanitybot/status/1780779079316156721

Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback (2404.10271v2)

Summary

Social Choice for AI Alignment: A Framework for Incorporating Diverse Human Feedback

Introduction to the Challenge

Value Alignment and RLHF: Current State

Constitutional AI and the Promise of Social Choice Theory

The Role of Computational Social Choice in AI Alignment

Novel Frameworks: From RLHF to RLCHF and Beyond

Key Concepts in Social Choice Relevant to AI

Addressing Behavioral and Multi-Agent Considerations

Conclusion and Path Forward

Related Papers

Tweets