Emergent Mind

Improving language models by retrieving from trillions of tokens

(2112.04426)
Published Dec 8, 2021 in cs.CL and cs.LG

Abstract

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.

Overview

  • The paper introduces the RETRO model, which enhances traditional transformer-based language models by integrating retrieval-augmented transformer blocks, particularly focusing on retrieving relevant document chunks during the model's forward pass.

  • Empirical evaluations of RETRO on multiple datasets, including Wikipedia and PubMed, show significant improvements in LAMBADA accuracy, perplexity metrics, and bits-per-byte reduction, indicating the model's effectiveness in context retrieval and compression efficiency.

  • The theoretical and practical implications of RETRO highlight its potential to overcome limitations of traditional transformers through dynamic context integration, suggesting promising applications in conversational agents, question-answering systems, and text summarization, with future research directions including optimization of retrieval processes and applications in multitask learning.

An Overview of RETRO: Enhancing Language Models with Retrieval-Augmented Transformer Blocks

The focus of this paper is on the innovative approach to augmenting large-scale language models, particularly transforming their architecture using retrieval-augmented transformer (RETRO) blocks. The paper provides a thorough examination of RETRO's impact on model performance, parameter efficiency, and introduces new methodologies for integrating retrieval mechanisms within transformer models. Detailed numerical evaluations across multiple datasets substantiate their findings.

RETRO Architecture and Methodology

The RETRO model diverges from conventional transformer-based models by incorporating retrieval mechanisms to access relevant document chunks during the forward pass. This process is driven by the following integral components:

  1. Frozen kNN Retriever: A pre-trained retrieval model retrieves the nearest neighbor document chunks relevant to the input tokens without fine-tuning during the training of the language model.
  2. Chunked Cross-Attention (CCA): This mechanism allows the model to attend to the encoded retrieved neighbors. Specifically, by deploying chunked cross-attention, the model can harness context from the retrieved information effectively.
  3. RETRO Blocks: These blocks, integrated within the transformer layers, combine inputs with retrieved contexts, subsequently processed through feed-forward networks. This design ensures the model scales effectively with the size of the retrieval database.

Empirical Performance Analysis

The evaluation of RETRO spans multiple datasets, including Wikipedia, OpenWebText, and more domain-specific datasets like arXiv and PubMed abstracts.

Key Numerical Results:

  • LAMBADA Accuracy: Consistently high accuracy was observed across different model sizes (172M, 425M, 1.5B, 7.5B parameters), indicating effective context retrieval mechanisms.
  • Perplexity Metrics: There was notable improvement in perplexity scores on various corpora such as Wikitext103:
  • 0.70 vs 0.50 (172M RETRO [ON] vs Baseline)
  • 0.65 vs 0.60 (1.5B RETRO [ON] vs Baseline)
  • Bits-Per-Byte (bpb) Reduction: Significant bpb reduction was noted when implementing RETRO on large datasets, highlighting the compression efficiency and reduced redundancy:
  • Relatively better bpb on Wikipedia September 2021 dataset with larger parameter models, from 0.60 to 0.85 depending on retrieval parameters.

Implications and Future Work

Theoretical Implications:

The RETRO model’s architecture demonstrates that retrieval-augmented approaches can mitigate some scaling limitations faced by traditional transformers. The chunked cross-attention mechanism adds a layer of dynamic context integration which could pave the way for more adaptive language models.

Practical Implications:

On a practical level, integrating RETRO blocks could improve real-world applications such as conversational agents, question-answering systems, and text summarization tools. This enhancement is particularly relevant for domains requiring access to large, dynamic knowledge bases.

Future Developments:

Future research could investigate optimizing the retrieval mechanisms further, focusing on faster kNN retrieval processes and refining the chunk selection strategies. Additionally, exploring RETRO's application in multitask learning scenarios and its potential in low-resource languages provides promising directions for the continued evolution of language models.

In conclusion, the paper positions RETRO as a formidable enhancement over traditional transformer models by effectively integrating retrieval mechanisms, demonstrating substantial improvements in model performance and parameter efficiency. The exploration into retrieval-augmented architectures such as RETRO holds substantial promise for future advancements in the field of natural language processing.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.