Emergent Mind

Abstract

Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), NLP, Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.

Paper counts for each dimension in the bidirectional human-AI alignment framework.

Overview

  • The paper critiques current unidirectional approaches to AI alignment and proposes a dynamic, bidirectional framework that encompasses aligning AI to humans and aligning humans to AI.

  • It presents a systematic review of over 400 papers, adhering to PRISMA guidelines, and explores themes such as human values, interaction techniques, evaluation methods, and future research directions.

  • Key challenges identified include the specification of appropriate values, the dynamic co-evolution of AI and human values, and the development of safe, modulable AI systems to manage evolving societal needs.

Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

In the context of contemporary AI development, the alignment of AI systems to be coherent with human values, goals, and ethical principles is paramount. The reviewed paper posits that current approaches to AI alignment are static and unidirectional, thus necessitating a re-evaluation towards a bidirectional and dynamic perspective. This proposal is encapsulated in a systematic review that spans four core themes: the clarification of definitions, a conceptual framework, an elucidation of key findings, and the proposition of future research directions. Here, we dissect these aspects comprehensively.

Core Definitions and Systematic Review Methodology

The paper addresses the ambiguity in human-AI alignment through three core questions: "With whom to align?", "What is the goal of alignment?", and "What values should be aligned with?". The authors argue for a pluralistic approach, acknowledging the diversity in human values while seeking comprehensive definitions and scopes. They adapt the Schwartz Theory of Basic Values, enhancing it to better accommodate AI contexts.

The systematic review methodology adhered to PRISMA guidelines and included papers from domains such as HCI, NLP, ML, and beyond, published between 2019 and 2024. Over 400 papers were reviewed, with a qualitative coding scheme ensuring rigorous and iterative refinement. The interdisciplinary team scrutinized papers with a core set of questions to ensure relevance and rigor.

Bidirectional Human-AI Alignment Framework

The framework proposed encompasses two primary directions: aligning AI to humans and aligning humans to AI.

Aligning AI to Humans

This direction concerns ensuring AI systems adhere to human values and involves two primary research questions:

  1. Human Values and Specifications: This theme addresses what values to align (e.g., individual, social, and interaction-based values) and how to specify these values through interaction techniques like explicit feedback (e.g., principles, rating/ranking) and implicit signals (e.g., discarded options, language analysis). The integration of these values into AI encompasses instruction data, model learning, and inference stage mechanisms.
  2. Integrating Human Specifications into AI: Here, the incorporation of human values into AI development is examined through customizing values to specific individuals or groups (e.g., finetuning, interactive alignment) and evaluating AI systems' adherence through methodologies like human-in-the-loop and automatic evaluation.

Aligning Humans to AI

This direction explores the human cognitive and behavioral adaptation needed for effective human-AI interaction and involves:

  1. Human Cognitive Adjustment to AI: This includes humans' learning to perceive, explain, and critique AI systems through education efforts like AI literacy, sensemaking, and critical thinking frameworks.
  2. Human Adaptive Behavior to AI: This involves understanding how humans respond and adapt to AI advancements, collaboration with AI systems, and assessing the societal impacts of AI technologies.

Analysis of Key Findings

Human Values for Alignment

The analysis identifies a crucial aspect: the relative priority of values matters as much as the values themselves. Human values are not fixed and may change dynamically, thus requiring frameworks that can accommodate these shifts. Additionally, AI systems should not attempt to embody all human values universally. Future research should aim to fully specify values through comprehensive datasets and adaptive algorithms.

Interaction Techniques

The comparison of interaction techniques used in HCI and NLP/ML domains revealed that while both domains leverage structured ratings and implicit signals, HCI tends to employ richer, multi-modal interactions. This indicates a potential for interdisciplinary collaboration to enhance value specification processes.

Evaluation Gaps

AI evaluations focus on algorithmic performance, whereas human evaluations emphasize user experience and qualitative insights. Bridging these gaps will involve developing methodologies that convert qualitative human feedback into quantitative data usable by AI systems, ensuring a comprehensive evaluation strategy.

Future Directions

Three significant challenges are identified for long-term human-AI alignment:

  1. Specification Game: This requires resolving how to fully specify appropriate values and integrate them into AI systems. Future work can draw from political theory and develop democratic methods for value specification.
  2. Dynamic Co-evolution of Alignment: As AI evolves, it must co-adapt with changing human values and societal needs. This involves updating AI systems iteratively using limited data and designing adaptive mechanisms without compromising existing values.
  3. Safeguarding Co-adaptation: Ensuring future AI systems do not engage in risky behaviors requires developing modulable AI architectures and robust override protocols. Humans must be trained to identify and manage these behaviors effectively.

Conclusion

The presented bidirectional human-AI alignment framework offers a foundational reference for future research. It lays out an extensive roadmap for dynamic and interactive alignment solutions, reflecting the need for interdisciplinary collaboration to navigate the complexities of AI systems in concert with human and societal evolution. The comprehensive analysis and proposed future directions provide a guiding vision toward achieving sustainable and adaptive AI alignment in the long term.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube