AI Alignment and Social Choice: Fundamental Limitations and Policy Implications (2310.16048v1)

Published 24 Oct 2023 in cs.AI, cs.CL, cs.CY, cs.HC, and cs.LG

Abstract: Aligning AI agents to human intentions and values is a key bottleneck in building safe and deployable AI applications. But whose values should AI agents be aligned with? Reinforcement learning with human feedback (RLHF) has emerged as the key framework for AI alignment. RLHF uses feedback from human reinforcers to fine-tune outputs; all widely deployed LLMs use RLHF to align their outputs to human values. It is critical to understand the limitations of RLHF and consider policy challenges arising from these limitations. In this paper, we investigate a specific challenge in building RLHF systems that respect democratic norms. Building on impossibility results in social choice theory, we show that, under fairly broad assumptions, there is no unique voting protocol to universally align AI systems using RLHF through democratic processes. Further, we show that aligning AI agents with the values of all individuals will always violate certain private ethical preferences of an individual user i.e., universal AI alignment using RLHF is impossible. We discuss policy implications for the governance of AI systems built using RLHF: first, the need for mandating transparent voting rules to hold model builders accountable. Second, the need for model builders to focus on developing AI agents that are narrowly aligned to specific user groups.

Citations (15)

View on Semantic Scholar

Summary

The paper demonstrates that inherent limitations in RLHF prevent universal AI alignment by leveraging Arrow's and Sen’s impossibility theorems.
It employs social choice theory to show how different voting rules in RLHF unavoidably favor individual preferences over achieving true democratic consensus.
The findings suggest policy reforms, including transparent model cards and targeted alignment strategies, to mitigate the risks of misaligned AI systems.

The paper "AI Alignment and Social Choice: Fundamental Limitations and Policy Implications" (2310.16048) investigates the inherent limitations of aligning AI systems with human values through democratic processes, particularly within the Reinforcement Learning with Human Feedback (RLHF) framework. It leverages social choice theory to demonstrate fundamental barriers in achieving universal AI alignment while adhering to democratic norms and respecting individual preferences.

RLHF and Democratic Alignment

The paper addresses the critical question of whose values should be embedded in AI systems, especially given the prevalent use of RLHF in training LLMs. RLHF uses feedback from human reinforcers to fine-tune outputs. The paper highlights a gap in understanding the limitations of RLHF in aligning AI systems at scale, particularly concerning the selection of representative human reinforcers. The paper examines whether voting rules can enable a group of reinforcers, representative of diverse users, to train an AI model using RLHF in a way that respects democratic norms.

Impossibility Theorems and AI Alignment

The core argument is based on two impossibility theorems from social choice theory: Arrow's theorem and Sen's theorem. Arrow's theorem states that no voting rule can simultaneously satisfy Pareto efficiency, transitivity, independence of irrelevant alternatives, and non-dictatorship. The paper explains that in the context of RLHF, Arrow’s theorem implies that any voting rule that is Pareto efficient, transitive, and independent of irrelevant alternatives must grant all decision-making authority to a single individual/reinforcer. Sen's theorem extends this by demonstrating that a social choice mechanism satisfying universal domain and Pareto optimality cannot simultaneously respect the individual rights of more than one individual.

Implications for AI Governance and Policy

The paper discusses the policy implications of these theoretical limitations for the governance of AI systems built using RLHF. One key implication is the non-uniqueness of AI alignment. The paper suggests that the choice of voting rule for each developer is private, i.e., there is no sharing of voting protocol between the developers, and the resulting AI models might not be consistently aligned since voting rules across developers might differ even when the preferences of the reinforcers hired by the different developers are consistent. The paper suggests that voting rules should be included in model cards to allow comparisons between different RLHF models and to promote transparency. The paper argues that building universally aligned AI agents using RLHF is impossible, and any AI agent built using RLHF will be misaligned with every user in some dimension. Therefore, the paper advocates for developing AI agents narrowly aligned with specific preferences of user groups, explicitly communicated during the reinforcement process, leading to a family of aligned models for specific tasks rather than generally aligned models.

Conclusion

The paper concludes that aligning AI models in a scalable manner while respecting democratic norms faces fundamental limitations. The use of RLHF in real-world applications has exploded in recent months, and the regulations and policies for deploying RLHF models at scale are still nascent. The paper argues that there will always be private preferences of users that an RLHF AI model built via democratic norms will violate. The paper's results indicate that the future of aligned AI development might be better served by incentivizing smaller model developers working on aligning their models to a narrow set of users as opposed to trying to build universally aligned AI.