Large Language Model Alignment: A Survey (2309.15025v1)

Published 26 Sep 2023 in cs.CL and cs.AI

Abstract: Recent years have witnessed remarkable progress made in LLMs. Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.

Citations (127)

View on Semantic Scholar

Summary

The paper establishes a comprehensive taxonomy of LLM alignment, detailing outer alignment, inner alignment, and mechanistic interpretability.
It examines vulnerabilities such as adversarial, privacy, and backdoor attacks while presenting evaluation benchmarks for alignment effectiveness.
The survey identifies future research directions, emphasizing scalable oversight and ethical, value-aligned AI for advancing AGI.

The paper "LLM Alignment: A Survey" provides an extensive exploration of alignment methodologies for LLM, incorporating capability research within the AI alignment domain. It categorizes alignment approaches into outer and inner alignment, examines model interpretability, and discusses vulnerabilities to adversarial attacks. The paper also presents benchmarks and evaluation methods for LLM alignment and contemplates future research directions.

The authors highlight the rapid advancements in LLM technology, exemplified by models like ChatGPT and GPT-4, and their potential as a path toward AGI (Artificial General Intelligence). They also acknowledge ethical risks associated with LLM development, including the perpetuation of harmful information such as biases and toxic content, leakage of private data, and the generation of misleading or false information. The authors note the potential for misuse of LLM and negative impacts on the environment, information dissemination, and employment.

The survey addresses long-term concerns about misaligned AGI posing existential risks, where an AI agent surpassing human intelligence could develop its own goals and monopolize resources, potentially leading to catastrophic outcomes for humanity. As a solution, the authors emphasize the importance of AI alignment, ensuring that AI systems produce outputs aligned with human values.

The authors observe that existing reviews predominantly focus on outer alignment, overlooking inner alignment and mechanistic interpretability. They aim to bridge this gap by providing a comprehensive overview of LLM alignment from the perspective of AI alignment, encompassing both outer alignment and areas such as inner alignment and mechanistic interpretability.

The paper proposes a taxonomy for LLM alignment, structured around outer alignment, inner alignment, and mechanistic interpretability. It discusses the necessity for LLM alignment research, introduces the origins of AI alignment and related concepts, elaborates on theoretical and technical approaches to aligning LLM, and discusses potential side effects and vulnerabilities of current alignment methods, including adversarial attacks. Methodologies and benchmarks for LLM alignment evaluation are presented, and future trends in LLM alignment research are considered.

The established risks of LLM are categorized as social and ethical risks, including undesirable content, such as biased, toxic, or privacy-sensitive content, and unfaithful content, like misinformation, hallucination, and inconsistency. Malicious uses of LLM, such as disinformation campaigns and generating code for cyber attacks, as well as negative impacts on society, like high energy consumption and carbon emissions, are also discussed.

The potential risks associated with advanced LLM include awareness, deception, self-preservation, and power-seeking, considered consequences of instrumental convergence. These behaviors have already been observed in current LLM, suggesting that advanced LLM may produce undesired behaviors, posing significant risks.

The definition of AI alignment is explored, with the paper defining it as ensuring that both the outer and inner objectives of AI agents align with human values, distinguishing between objectives defined by AI designers and those optimized within AI agents.

Key research agendas of AI alignment, including outer alignment, inner alignment, and interpretability, are identified. Outer alignment involves choosing the right loss functions or reward functions to ensure that the training objectives of AI systems match human values. Inner alignment ensures that AI systems are trained to achieve the goals set by their designers, addressing the challenge of AI systems developing unpredictable behaviors. Interpretability, encompassing transparency and explainability, facilitates human understanding of the inner workings, decisions, and actions of AI systems.

Related concepts such as the orthogonality thesis and the instrumental convergence thesis are introduced as fundamental AGI assumptions that address the necessity of alignment of AI objectives with human values and the potential subgoals any AI agents might chase.

Outer alignment is explored, focusing on the alignment of goals of LLM to human values, categorized into helpfulness, honesty, and harmlessness. Approaches to outer alignment are categorized into non-recursive oversight methods and scalable oversight methods, based on the upper bound of capabilities reachable in supervision.

Non-recursive oversight methods, including RL (Reinforcement Learning)-based and SL (Supervised Learning)-based methods, learn training goals directly from labeled human feedback data. RLHF (Reinforcement Learning from Human Feedback) is identified as a commonly used non-recursive oversight method, using human preferences as a proxy to specify human values and train a reward model over human preferences to optimize LLM. Variants of RLHF and other RL-based methods are discussed. SL-based methods, including those with text-based and ranking-based feedback signals, convert human intents and preferences into feedback signals to achieve alignment. Challenges of non-recursive oversight, including the difficulty in obtaining quality feedback and the inability of humans to provide feedback for complex tasks, are noted.

Scalable oversight is presented as a promising methodology to tackle the fundamental challenge of non-recursive oversight in the scalability to complex tasks/superhuman models. Task decomposition, Constitutional AI, debate, market making, and proxy tasks are discussed as scalable oversight approaches. Task decomposition involves breaking down a complex task into simpler subtasks, while Constitutional AI uses human-written principles to guide AI systems in generating training instances. Debate involves agents proposing answers and presenting arguments for and against them, with a human judge selecting the most accurate answer. Market making involves training models to predict human answers and generate arguments to change the model's mind. Proxy tasks use self-consistency to oversee superhuman models. Challenges of scalable oversight, such as the assumption that tasks can be parallelized and that model intentions are transparent to humans, are considered.

The paper defines inner alignment, differentiating between base optimizers, base objectives, mesa-optimizers, and mesa-objectives. Inner alignment failures are categorized into proxy alignment, approximate alignment, and suboptimality alignment. Methodologies for inner alignment, such as relaxed adversarial training, are explored, along with empirical experiment proposals for inner alignment, including reward side-channels, cross-episodic objectives, objective unidentifiability, zero-shot objectives, and robust reward learning.

Mechanistic interpretability is examined, focusing on elucidating the internal mechanisms by which a machine learning model transforms inputs into outputs, providing causal and functional explanations for predictions. The goal of mechanistic interpretability is to reverse engineer the reasoning process, decomposing neural networks into interpretable parts and flows of information. Mechanistic interpretability on self-attention, MLP (Multi-Layer Perceptron), and neurons is discussed, along with the challenges of mechanistic interpretability, such as the superposition hypothesis and non-linear representations.

Attacks on aligned LLMs are categorized into privacy attacks, backdoor attacks, and adversarial attacks. Privacy attacks exploit machine learning models to extract private or sensitive information about the training data. Backdoor attacks aim to cause the model to produce specific, incorrect outputs when certain backdoor triggers are detected. Adversarial attacks compromise the performance of machine learning models by introducing small perturbations to the input data.

Alignment evaluation is structured across multiple levels, focusing on factuality, ethics, toxicity, stereotype and bias, and general evaluation. Task-specific evaluation and LLM-centered evaluation are distinguished, with the third level providing fine-grained classifications and showcasing related works.

Future directions for LLM alignment are discussed, including theoretical research in decision theory, corrigibility, and world models, as well as scalable oversight, empirical research into deceptive alignment, automated LLM alignment, explainability and transparency, dynamic evaluation of LLM alignment via adversarial attacks, and field building of LLM alignment by bridging between the LLM and AI alignment communities.

The paper concludes by emphasizing the importance of aligning advanced AI systems to human values and the need for collaborative approaches to harness the full potential of LLM ethically and beneficially.

PDF Markdown

YouTube

Show All Videos

Large Language Model Alignment: A Survey (2309.15025v1)

Summary

Related Papers

YouTube