Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

124 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Pretraining Language Models with Human Preferences (2302.08582v2)

Published 16 Feb 2023 in cs.CL and cs.LG

Abstract: LLMs (LMs) are pretrained to imitate internet text, including content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, and more. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the trade-off between alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores given by a reward model. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.

References (91)

Citations (184)

View on Semantic Scholar

Summary

The paper demonstrates that conditional training during pretraining can reduce toxic and noncompliant outputs by up to an order of magnitude.
It evaluates five methods, including unlikelihood training and reward weighted regression, to balance alignment with retained LM capabilities.
Findings suggest a paradigm shift where integrating human feedback early in training enhances safety, robustness, and overall model reliability.

Pretraining LLMs with Human Preferences Overview

The paper "Pretraining LLMs with Human Preferences" explores novel methodologies for developing LLMs (LMs) that inherently generate outputs aligned with human preferences. Instead of taking the conventional approach where alignment with human preferences is only considered during post-training finetuning, this paper investigates aligning LMs during the pretraining phase itself. This is accomplished by adjusting standard pretraining objectives, evaluating five distinct strategies for pretraining LMs with human feedback, and analyzing their performance across three specific tasks: minimizing toxicity, preventing personal identifiable information (PII) leakage, and ensuring code compliance with style guidelines.

The central claim of the paper is that incorporating human feedback into the pretraining of LMs can lead to a significant reduction in undesirable outputs without compromising the core capabilities of the models, challenging the existing paradigm of only aligning LMs during finetuning.

Methods

The authors propose and examine five objectives for pretraining LMs with human feedback:

Conditional Training: This approach enhances maximum likelihood estimation (MLE) by conditioning the training process on segments of data being labeled with a human preference score. The model learns to associate each segment with a control token that corresponds to the segment's human preference score.
Dataset Filtering: Filtering involves preprocessing the training data to exclude any instances falling below a specified threshold of human preference scores before standard MLE pretraining.
Unlikelihood Training: This technique employs unlikelihood objectives where undesirable generation behavior is discouraged by reducing the likelihood of undesirable tokens during training.
Reward Weighted Regression (RWR): It incorporates human preference scores directly into the training objective by weighting token log likelihoods with exponentiated reward values.
Advantage-Weighted Regression (AWR): A variant of RWR, AWR employs a value function to adjust the segment-level rewards used in RWR, introducing a learned advantage estimator.

The efficacy of each method is evaluated against standard MLE in achieving both alignment (reducing undesired model outputs) and preserving the LM's general capabilities, as measured by the KL divergence from well-performing models like GPT-3 and task-specific evaluations.

Results and Implications

The paper's experiments reveal that conditional training consistently provides a robust alignment-capability trade-off, reducing undesired content across all tested tasks (toxicity, PII, and PEP8 compliance) without impairing the LM's generalizability or downstream performance on tasks such as GLUE benchmarks. In many settings, conditional training substantially decreases the probability of LM outputs manifesting undesirable content by up to an order of magnitude, outperforming even advanced post-pretraining finetuning techniques.

Furthermore, conditional training aligns well with both degradation constraints and diversity maintenance, as opposed to previously noted issues like degeneration or reduced diversity that some alignment mechanisms inadvertently produce. Adversarial robustness is also demonstrated, with models pretrained under conditional objectives showing notably less susceptibility to adversarial prompt engineering than baseline MLE-pretrained models.

By highlighting these results, the paper stresses a paradigm shift in LM training practices: the consideration of human preferences from the initial stages of training can be more advantageous than current methodologies which postpone alignment to later stages like finetuning or rule-based filters. This approach eliminates the complexity of unlearning undesirable behavior learned during large-scale text imitation and addresses potential performance degradation associated with abrupt post-pretraining interventions.

Future Directions

The reduction of undesirable behaviors through pretraining with human feedback paves the way for several future explorations. Practically, the work suggests avenues to improve current LMs' alignment methods by refining reward functions, evaluating alignment on expanded tasks beyond the initial three, and deploying conditional training paradigms in diverse LLM architectures. Theoretically, ongoing research may involve investigating the intrinsic trade-offs between generalization and robustness that conditional pretraining implicates, particularly as models scale in parameters and data volume. Integrating more granular and dynamic human feedback throughout pretraining could further enhance the adaptable nature of LMs in volatile and unpredictable operational environments, fortifying their ethical and performance benchmarks.

In summary, the proposed shift to pretraining methods that incorporate human preferences fundamentally questions the status quo of LM alignment, introducing strategies that enhance safety and reliability while preserving computational efficacy.

PDF Markdown

GitHub

GitHub - tomekkorbak/pretraining-with-human-feedback: Code accompanying the paper Pretraining Language Models with Human Preferences (182 stars)

Tweets

https://twitter.com/tomekkorbak/status/1791786412792066131

https://twitter.com/BogdanIonutCir2/status/1819848009473036537

https://twitter.com/JacquesThibs/status/1935164636363682243

https://twitter.com/EitanTurok/status/1808409445463933193

https://twitter.com/DanielCHTan97/status/1885185512815943745

https://twitter.com/tomekkorbak/status/1912920082222858504

YouTube

Show All Videos