RAIN: Your Language Models Can Align Themselves without Finetuning (2309.07124v2)

Published 13 Sep 2023 in cs.CL

Abstract: LLMs often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.

References (37)

Authors (5)

Yuhui Li (15 papers)
Fangyun Wei (53 papers)
Jinjing Zhao (6 papers)
Chao Zhang (907 papers)
Hongyang Zhang (71 papers)

Citations (85)

View on Semantic Scholar

Summary

The paper introduces the RAIN method that aligns pre-trained LLMs during inference using self-evaluation and rewind mechanisms.
The method employs a tree-structured search that combines forward expansion and backward rewinding to dynamically adjust token generation.
Experimental results highlight improved harmlessness and robustness in models like LLaMA, confirming RAIN’s practical safety benefits with manageable overhead.

Overview of Novel Inference Method

In the field of LLM alignment—ensuring a LLM's output conforms to human values—most existing techniques require extensive finetuning and data annotation. A newly introduced inference method, however, sidesteps these resource-intensive processes. This method, named Rewindable Auto-regressive INference (RAIN), allows pre-trained LLMs to self-adjust during inference by incorporating self-evaluation and rewind mechanisms, effectively producing aligned outputs without model retraining or the need for additional data.

Aligning Pre-trained LLMs

Historically, aligning LLMs to human preferences necessitated finetuning steps utilizing significant amounts of human-collected preference data. However, the RAIN approach is a departure from this paradigm. It leverages the inherent abilities of LLMs to judge their generated content and to guide subsequent regenerations based on those judgments. This process enables the model to rewind and adjust if the content produced is deemed inconsistent with the desired criteria, thus inherently aligning the model's outputs with human preferences.

How RAIN Operates

RAIN’s modus operandi bears resemblance to human contemplative behavior—analyzing and weighing consequences before finalizing a decision. The model's attributes are dynamically adjusted during a search on a tree-like structure where each node represents a token sequence. RAIN combines forward and backward searches: forwarding to expand the search tree with new token sets, and backward to rewind and prepare for further searches. By judiciously using updated node attributes, RAIN steers the generation process towards more aligned directions. Moreover, the process is continuously refined using similarity measures among token sets, allowing for efficient exploration even within such a vast search space.

Experimental Validation

RAIN's effectiveness is underscored by empirical results. Tested models, such as LLaMA, showed significant improvements in alignment tasks—increasing the harmlessness rate without sacrificing helpfulness. Furthermore, RAIN demonstrated greater resilience against attempts to induce the model into generating harmful responses. It proved to be robust even without being designed as an adversarial defense tool. Performance improvements and robustness rise notably with model size. Interestingly, while RAIN does induce a computational overhead compared to vanilla auto-regressive inference, the time increase is deemed manageable, especially considering the safety benefits obtained.

Conclusion

The research illustrates the capacity of LLMs to self-align without external data or finetuning. RAIN represents a significant step forward in the practical alignment of LLMs, enhancing safety while minimizing the computational requirements traditionally associated with such tasks. It paves the way for more efficient and safer use of pre-trained LLMs in various applications.

Related Papers

GitHub

GitHub - SafeAILab/RAIN: Official implementation of [RAIN: Your Language Models Can Align Themselves without Finetuning] (94 stars)

Tweets

https://twitter.com/0xmaddie_/status/1778092764879573445

https://twitter.com/0xmaddie_/status/1794164565439316180

https://twitter.com/0xmaddie_/status/1758353133095309767

https://twitter.com/0xmaddie_/status/1749954976338600181

YouTube

Show All Videos