The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning (2312.01552v1)

Published 4 Dec 2023 in cs.CL and cs.AI

Abstract: The alignment tuning process of LLMs typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

Authors (8)

Bill Yuchen Lin (72 papers)
Abhilasha Ravichander (33 papers)
Ximing Lu (52 papers)
Nouha Dziri (40 papers)
Melanie Sclar (12 papers)
Khyathi Chandu (17 papers)
Chandra Bhagavatula (46 papers)
Yejin Choi (287 papers)

Citations (128)

View on Semantic Scholar

Summary

The paper demonstrates that alignment tuning primarily adjusts stylistic tokens rather than altering core content.
The paper introduces URIAL, a tuning-free method that uses curated prompts and in-context examples to match or exceed traditional tuning performance.
The paper employs a multi-aspect evaluation protocol with the just-eval-instruct dataset to assess dimensions like clarity, factuality, and safety.

LLMs have shown impressive capabilities in following user instructions and preferences when further tuned through methods such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). However, recent work has raised questions about the depth of changes these alignment tunings bring about in LLMs, suggesting that their impact might be somewhat "superficial." This discussion forms the foundation for a paper that deeply investigates alignment tuning by comparing the token distributions of base LLMs against their fine-tuned counterparts.

The paper's analysis uncovers a striking similarity in token selection during decoding for the majority of positions between base and aligned LLMs, with significant shifts observed chiefly among stylistic tokens like discourse markers and safety disclaimers, rather than content-driven tokens. This indicates that much of what alignment tuning achieves is primarily the adoption of the language style characteristic of responsible AI assistants, capitalizing on knowledge the base LLMs already possess.

Moving beyond the conventional fine-tuning practices, the paper introduces a novel, tuning-free alignment method, Untuned LLMs with Restyled In-context Alignment (URIAL). URIAL seeks to leverage the base LLM's in-context learning capability by employing carefully curated stylistic examples and a dedicated system prompt to align the LLM without modifying its parameters. The evaluation of URIAL against SFT and RLHF methods reveals its ability to match or surpass their performance when using strong base LLMs, suggesting that with strategic prompting and in-context learning, tuning-free methods can effectively close the alignment gap.

The paper addresses the tuning-free alignment through a rigorously designed multi-aspect, interpretable evaluation protocol and a dataset branded just-eval-instruct. This evaluation encompasses multiple dimensions such as helpfulness, clarity, factuality, depth, engagement, and safety, providing a granular and insightful review of LLM outputs. Results demonstrated by URIAL underscore the potential of targeting alignment via inference-time methods as a promising alternative to more resource-intensive tuning approaches.

In essence, this paper critically re-examines the necessity of parameter tuning for aligning LLMs and opens doors for more efficient and resource-conservative methodologies. It shines a light on the underappreciated capacity of base LLMs to align through in-context learning and emphasizes the pivotal role of high-quality, strategically crafted prompts. The implications of these findings carry significant weight for future research in LLM analysis and alignment, suggesting a shift towards methods that amplify existing knowledge within LLMs rather than layers of additional fine-tuning.