- The paper demonstrates that inherent limitations in RLHF prevent universal AI alignment by leveraging Arrow's and Sen’s impossibility theorems.
- It employs social choice theory to show how different voting rules in RLHF unavoidably favor individual preferences over achieving true democratic consensus.
- The findings suggest policy reforms, including transparent model cards and targeted alignment strategies, to mitigate the risks of misaligned AI systems.
AI Alignment and Social Choice: Fundamental Limitations
The paper "AI Alignment and Social Choice: Fundamental Limitations and Policy Implications" (2310.16048) investigates the inherent limitations of aligning AI systems with human values through democratic processes, particularly within the Reinforcement Learning with Human Feedback (RLHF) framework. It leverages social choice theory to demonstrate fundamental barriers in achieving universal AI alignment while adhering to democratic norms and respecting individual preferences.
RLHF and Democratic Alignment
The paper addresses the critical question of whose values should be embedded in AI systems, especially given the prevalent use of RLHF in training LLMs. RLHF uses feedback from human reinforcers to fine-tune outputs. The paper highlights a gap in understanding the limitations of RLHF in aligning AI systems at scale, particularly concerning the selection of representative human reinforcers. The paper examines whether voting rules can enable a group of reinforcers, representative of diverse users, to train an AI model using RLHF in a way that respects democratic norms.
Impossibility Theorems and AI Alignment
The core argument is based on two impossibility theorems from social choice theory: Arrow's theorem and Sen's theorem. Arrow's theorem states that no voting rule can simultaneously satisfy Pareto efficiency, transitivity, independence of irrelevant alternatives, and non-dictatorship. The paper explains that in the context of RLHF, Arrow’s theorem implies that any voting rule that is Pareto efficient, transitive, and independent of irrelevant alternatives must grant all decision-making authority to a single individual/reinforcer. Sen's theorem extends this by demonstrating that a social choice mechanism satisfying universal domain and Pareto optimality cannot simultaneously respect the individual rights of more than one individual.
Implications for AI Governance and Policy
The paper discusses the policy implications of these theoretical limitations for the governance of AI systems built using RLHF. One key implication is the non-uniqueness of AI alignment. The paper suggests that the choice of voting rule for each developer is private, i.e., there is no sharing of voting protocol between the developers, and the resulting AI models might not be consistently aligned since voting rules across developers might differ even when the preferences of the reinforcers hired by the different developers are consistent. The paper suggests that voting rules should be included in model cards to allow comparisons between different RLHF models and to promote transparency. The paper argues that building universally aligned AI agents using RLHF is impossible, and any AI agent built using RLHF will be misaligned with every user in some dimension. Therefore, the paper advocates for developing AI agents narrowly aligned with specific preferences of user groups, explicitly communicated during the reinforcement process, leading to a family of aligned models for specific tasks rather than generally aligned models.
Conclusion
The paper concludes that aligning AI models in a scalable manner while respecting democratic norms faces fundamental limitations. The use of RLHF in real-world applications has exploded in recent months, and the regulations and policies for deploying RLHF models at scale are still nascent. The paper argues that there will always be private preferences of users that an RLHF AI model built via democratic norms will violate. The paper's results indicate that the future of aligned AI development might be better served by incentivizing smaller model developers working on aligning their models to a narrow set of users as opposed to trying to build universally aligned AI.