REALM: Retrieval-Augmented Language Model Pre-Training (2002.08909v1)

Published 10 Feb 2020 in cs.CL and cs.LG

Abstract: LLM pre-training has been shown to capture a surprising amount of world knowledge, crucial for NLP tasks such as question answering. However, this knowledge is stored implicitly in the parameters of a neural network, requiring ever-larger networks to cover more facts. To capture knowledge in a more modular and interpretable way, we augment LLM pre-training with a latent knowledge retriever, which allows the model to retrieve and attend over documents from a large corpus such as Wikipedia, used during pre-training, fine-tuning and inference. For the first time, we show how to pre-train such a knowledge retriever in an unsupervised manner, using masked LLMing as the learning signal and backpropagating through a retrieval step that considers millions of documents. We demonstrate the effectiveness of Retrieval-Augmented LLM pre-training (REALM) by fine-tuning on the challenging task of Open-domain Question Answering (Open-QA). We compare against state-of-the-art models for both explicit and implicit knowledge storage on three popular Open-QA benchmarks, and find that we outperform all previous methods by a significant margin (4-16% absolute accuracy), while also providing qualitative benefits such as interpretability and modularity.

Citations (1,751)

View on Semantic Scholar

Summary

The paper introduces a retrieval-augmented pre-training framework that trains an external document retriever alongside the language model.
It demonstrates improved Open-QA benchmark performance with 4–16% absolute accuracy gains over traditional models.
The approach enables modular and interpretable knowledge storage, paving the way for dynamic updates without full model retraining.

Exploration of Retrieval-Augmented LLM Pre-Training (REALM)

Introduction

The paper presents a novel framework for augmenting LLM pre-training with a learned textual knowledge retriever, termed Retrieval-Augmented LLM Pre-Training (REALM). It introduces an unsupervised method for pre-training a knowledge retriever alongside the LLM. This contrasts with traditional LLMs like BERT, RoBERTa, and T5, which encapsulate knowledge implicitly within their parameters. REALM seeks to modularize knowledge storage, making it both interpretable and extensive, by leveraging external documents during prediction. The framework exhibits superior performance on Open-domain Question Answering (Open-QA) benchmarks, evidencing its capacity to effectively incorporate and leverage external world knowledge.

Background

The motivation behind REALM arises from the limitations of storage space within the network parameters of current LLMs. As these models are trained on extensive corpora, the encapsulated knowledge grows with network size, making it difficult to scale and interpret the stored information. The paper highlights the necessity for a more scalable and explicit method of knowledge storage and recall.

Approach

REALM decomposes the prediction of an output y given an input x into two distinct steps: retrieval and prediction. The framework uses a neural knowledge retriever to select relevant documents from a large corpus like Wikipedia and then employs a knowledge-augmented encoder to predict the output based on the input and retrieved documents. The model optimizes the marginal likelihood of this generative process, requiring adaptation of both the retriever and encoder through backpropagation. The real challenge and novel contribution lie in efficiently managing and backpropagating through the retrieval step, which involves a substantial corpus of millions of documents. This complexity is addressed through a sophisticated implementation utilizing Maximum Inner Product Search (MIPS) for efficient document retrieval and caching mechanisms.

Experiments and Results

REALM demonstrates outstanding performance when fine-tuned on Open-QA tasks, surpassing state-of-the-art models on popular benchmarks such as Natural Questions-Open, WebQuestions, and CuratedTrec, with improvements in absolute accuracy ranging between 4% to 16%. These results serve as a strong indicator of REALM's enhanced capability in incorporating and leveraging external knowledge effectively.

Implications and Future Directions

The demonstrated ability of REALM to utilize external documents in LLM pre-training suggests several promising directions for future research. The modular knowledge approach opens up possibilities for dynamic knowledge bases that can be updated without retraining the model from scratch, enhancing the model's adaptability to new information. Furthermore, the successful integration of retrieval mechanisms in not only the inference phase but also during pre-training paves the way for exploration in other domains such as structured knowledge bases, multimedia data, and multilingual corpora.

Another intriguing aspect is the model-centric unsupervised alignments between the pre-training corpus and the knowledge corpus. These alignments offer a new lens through which to analyze and interpret the interactions between learned representations and external knowledge sources.

Summary

In sum, Retrieval-Augmented LLM Pre-Training (REALM) marks a significant step forward in the unsupervised pre-training of LLMs. By combining the strengths of neural retrievers with the rich representational capabilities of modern LLMs, REALM not only pushes the boundaries of what's achievable in Open-QA but also opens new avenues for research in knowledge-intensive applications of AI. The framework's potential to leverage updated and diverse forms of external knowledge dynamically introduces a robust approach to tackling the challenges of scalability and adaptability in knowledge storage within neural networks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_parasj/status/1774613936291373508

https://twitter.com/dunder_main/status/1862806981272375519

https://twitter.com/CptRandlelwa/status/1917510262438986226

YouTube

Show All Videos