Emergent Mind

Abstract

LLMs often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.

Overview

  • The paper introduces a novel inference method called Rewindable Auto-regressive INference (RAIN) that enables pre-trained LLMs to self-adjust and produce aligned outputs without further training or extra data.

  • RAIN leverages LLMs' intrinsic capability for self-critique to guide output regeneration, aligning it with human values by incorporating a rewind mechanism in the generation process.

  • The method involves a tree-like search structure, combining forward and backward searches, to dynamically yield more aligned outputs, resembling human contemplation and decision-making.

  • Empirical evidence shows that RAIN significantly improves alignment in LLMs, like LLaMA, by increasing the harmlessness rate without losing helpfulness and exhibits robustness against harmful content generation.

  • RAIN is heralded as a significant advancement in language model alignment, enabling safer and more efficient use of pre-trained LLMs, reducing the need for computational resources usually required for model finetuning.

Overview of Novel Inference Method

In the realm of language model alignment—ensuring a language model's output conforms to human values—most existing techniques require extensive finetuning and data annotation. A newly introduced inference method, however, sidesteps these resource-intensive processes. This method, named Rewindable Auto-regressive INference (RAIN), allows pre-trained LLMs to self-adjust during inference by incorporating self-evaluation and rewind mechanisms, effectively producing aligned outputs without model retraining or the need for additional data.

Aligning Pre-trained Language Models

Historically, aligning LLMs to human preferences necessitated finetuning steps utilizing significant amounts of human-collected preference data. However, the RAIN approach is a departure from this paradigm. It leverages the inherent abilities of LLMs to judge their generated content and to guide subsequent regenerations based on those judgments. This process enables the model to rewind and adjust if the content produced is deemed inconsistent with the desired criteria, thus inherently aligning the model's outputs with human preferences.

How RAIN Operates

RAIN’s modus operandi bears resemblance to human contemplative behavior—analyzing and weighing consequences before finalizing a decision. The model's attributes are dynamically adjusted during a search on a tree-like structure where each node represents a token sequence. RAIN combines forward and backward searches: forwarding to expand the search tree with new token sets, and backward to rewind and prepare for further searches. By judiciously using updated node attributes, RAIN steers the generation process towards more aligned directions. Moreover, the process is continuously refined using similarity measures among token sets, allowing for efficient exploration even within such a vast search space.

Experimental Validation

RAIN's effectiveness is underscored by empirical results. Tested models, such as LLaMA, showed significant improvements in alignment tasks—increasing the harmlessness rate without sacrificing helpfulness. Furthermore, RAIN demonstrated greater resilience against attempts to induce the model into generating harmful responses. It proved to be robust even without being designed as an adversarial defense tool. Performance improvements and robustness rise notably with model size. Interestingly, while RAIN does induce a computational overhead compared to vanilla auto-regressive inference, the time increase is deemed manageable, especially considering the safety benefits obtained.

Conclusion

The research illustrates the capacity of LLMs to self-align without external data or finetuning. RAIN represents a significant step forward in the practical alignment of language models, enhancing safety while minimizing the computational requirements traditionally associated with such tasks. It paves the way for more efficient and safer use of pre-trained language models in various applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.