Emergent Mind

Contrastive Decoding: Open-ended Text Generation as Optimization

(2210.15097)
Published Oct 27, 2022 in cs.CL , cs.AI , and cs.LG

Abstract

Given a language model (LM), maximum probability is a poor decoding objective for open-ended generation, because it produces short and repetitive text. On the other hand, sampling can often produce incoherent text that drifts from the original topics. We propose contrastive decoding (CD), a reliable decoding approach that optimizes a contrastive objective subject to a plausibility constraint. The contrastive objective returns the difference between the likelihood under a large LM (called the expert, e.g. OPT-13B) and a small LM (called the amateur, e.g. OPT-125M), and the constraint ensures that the outputs are plausible. CD is inspired by the fact that the failures of larger LMs (e.g., repetition, incoherence) are even more prevalent in smaller LMs, and that this difference signals which texts should be preferred. CD requires zero additional training, and produces higher quality text than decoding from the larger LM alone. It also works across model scales (OPT-13B and GPT2-1.5B) and significantly outperforms four strong decoding algorithms (e.g., nucleus, top-k) in automatic and human evaluations across wikipedia, news and story domains.

Overview

  • Introduces contrastive decoding (CD) for improving open-ended text generation by leveraging discrepancies between a large LM (expert) and a small LM (amateur).

  • Describes how CD calculates the log probability differences between a large and small LM to avoid repetition and incoherence without extra training.

  • Reports that CD outperforms traditional sampling methods in coherence, with human evaluations favoring CD, and shows potential for larger models.

  • Highlights the efficiency of CD due to its use of pre-trained models and suggests future research avenues for its application in various language tasks.

Introduction to Contrastive Decoding

The paper introduces a novel approach called contrastive decoding (CD), designed to ameliorate common issues encountered in open-ended text generation with language models (LMs). Traditional maximum likelihood methods tend to result in redundant and brief text, while straightforward sampling methods often lead to incoherence and divergence from initial context. CD innovatively leverages both a large LM ("expert") and a small LM ("amateur"), focusing on the practical use of discrepancies in their performance to guide text generation toward coherence without abandoning lexical diversity.

Understanding the Approach

CD functions on the principle that smaller LMs exhibit prominent issues such as repetition and incoherence more frequently than their larger counterparts. By calculating the difference in log probabilities between a large and small LM for given text, and subjectively navigating this space under a constraint of plausibility, CD effectively sieves out undesirable textual patterns. Remarkably, this approach requires no additional training on top of the existing pre-trained models and easily adapts across different scales and architectures, such as the OPT and GPT-2 series.

Empirical Validation

The method surpasses several strong baselines including nucleus, top-k, and typical sampling algorithms in various domains like Wikipedia, news, and storytelling. Notably, automatic evaluations reveal that CD achieves higher coherence scores, maintaining comparable fluency levels to other methods, with a preference for CD noted in human evaluations as well. Importantly, the divergence between CD and sampling methods narrows with increasing model size, hinting at gradual but significant improvements as models scale.

Advantages and Extensions

CD's reliance on contrasting probabilities from different model capacities promotes an intriguing notion that such discrepancies can be harnessed without necessitating complex re-training or fine-tuning procedures. This stands as an advantage for efficient deployment in practical applications. Moreover, the paper suggests several interesting avenues for further exploration, such as contrasting early and later checkpoints of the same LM or extending the contrasting approach to task-oriented language generation.

In conclusion, contrastive decoding, through its innovative use of existing LMs of varying capacities, provides an effective means to improve the quality of open-ended text generation. Its ability to generate content that aligns closer with a given topic while preserving natural language flow represents a significant stride forward in generative AI.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.