Emergent Mind

Learning to Tokenize for Generative Retrieval

(2304.04171)
Published Apr 9, 2023 in cs.IR

Abstract

Conventional document retrieval techniques are mainly based on the index-retrieve paradigm. It is challenging to optimize pipelines based on this paradigm in an end-to-end manner. As an alternative, generative retrieval represents documents as identifiers (docid) and retrieves documents by generating docids, enabling end-to-end modeling of document retrieval tasks. However, it is an open question how one should define the document identifiers. Current approaches to the task of defining document identifiers rely on fixed rule-based docids, such as the title of a document or the result of clustering BERT embeddings, which often fail to capture the complete semantic information of a document. We propose GenRet, a document tokenization learning method to address the challenge of defining document identifiers for generative retrieval. GenRet learns to tokenize documents into short discrete representations (i.e., docids) via a discrete auto-encoding approach. Three components are included in GenRet: (i) a tokenization model that produces docids for documents; (ii) a reconstruction model that learns to reconstruct a document based on a docid; and (iii) a sequence-to-sequence retrieval model that generates relevant document identifiers directly for a designated query. By using an auto-encoding framework, GenRet learns semantic docids in a fully end-to-end manner. We also develop a progressive training scheme to capture the autoregressive nature of docids and to stabilize training. We conduct experiments on the NQ320K, MS MARCO, and BEIR datasets to assess the effectiveness of GenRet. GenRet establishes the new state-of-the-art on the NQ320K dataset. Especially, compared to generative retrieval baselines, GenRet can achieve significant improvements on the unseen documents. GenRet also outperforms comparable baselines on MS MARCO and BEIR, demonstrating the method's generalizability.

Overview

  • Introduces GenRet, a novel approach for document retrieval that uses discrete auto-encoding to generate semantically meaningful document identifiers (docids) for efficient retrieval.

  • Details a comprehensive training scheme for GenRet, including a progressive training methodology and diverse clustering techniques for docid generation and re-assignment.

  • Demonstrates GenRet's superior performance over state-of-the-art models on benchmark datasets like NQ320K, showing notable improvements in retrieving unseen documents.

  • Highlights the theoretical and practical implications of GenRet, including the resolution of lexical mismatch problems and the potential for scalable document retrieval systems.

Learning to Tokenize for Generative Retrieval: A Novel Approach for Document Identification

Introduction to Generative Retrieval and the Problem Space

The landscape of document retrieval techniques has significantly morphed with the advent of pre-trained language models (LMs), transitioning from the traditional index-retrieve paradigm to more sophisticated approaches like dense retrieval (DR) models. These models leverage the advancements in LMs to learn dense representations of queries and documents, significantly alleviating the issue of lexical mismatch. However, DR models are not without their limitations, primarily due to their index-retrieval pipeline and the misalignment between their learning strategies and the pre-training objectives of LMs.

A new paradigm, generative retrieval, emerges as an alternative, characterizing documents with identifiers (docids) and retrieving these documents by generating their docids end-to-end. This presents a promising avenue for better leveraging large LMs but introduces the challenge of defining appropriate document identifiers that can accurately capture document semantics.

Overview of GenRet

To tackle the nuances of generating semantically meaningful docids, the paper introduces GenRet, a novel document tokenization learning method optimized for generative retrieval tasks. GenRet adopts a discrete auto-encoding framework, coupled with a sequence-to-sequence retrieval model, to tokenize documents into concise, discrete representations. This approach includes several key components:

  • A tokenization model that generates docids for documents.
  • A reconstruction model that leverages these docids to reconstruct the original documents, ensuring the semantic integrity of the identified docids.
  • An end-to-end optimized generative model that accurately retrieves documents for a given query by autoregressively generating relevant docids.

Methodology and Implementation

The efficacy of GenRet is attributed to its comprehensive training scheme, which encompasses a progressive training methodology to capture the autoregressive nature of docid generation. This includes a series of losses: a reconstruction loss ensuring semantic capture, a commitment loss to prevent model forgetting, and a retrieval loss facilitating the optimization of retrieval performance. Additionally, GenRet addresses the challenge of docid diversity through a parameter initialization strategy and a novel docid re-assignment procedure based on diverse clustering techniques.

Experimental Results and Implications

GenRet was rigorously evaluated against existing state-of-the-art models across several benchmark datasets, including NQ320K, MS MARCO, and BEIR. The results were promising, establishing new performance benchmarks on the NQ320K dataset and demonstrating significant improvements, especially in the retrieval of unseen documents. GenRet's ability to considerably outperform previous methods in generalization reflects its robustness and versatility across various retrieval tasks.

Theoretical and Practical Contributions

This work makes several notable contributions to the domain of document retrieval. GenRet's discrete auto-encoding framework represents a pioneering approach to learn semantic docids, offering a significant leap towards resolving the lexical mismatch problem inherent in traditional retrieval methods. The proposed progressive training scheme and diverse clustering techniques further enhance the model's capability to produce and utilize semantically rich docids. From a practical standpoint, GenRet's conceptualization offers a scalable solution to the ever-growing demand for effective and efficient document retrieval systems.

Looking Ahead

Despite the demonstrable advancements introduced by GenRet, the exploration of document tokenization for generative retrieval is in its nascent stages. Future research directions could include expanding the model's scalability to accommodate larger document collections and further refining the tokenization learning process. Additionally, integrating generative pre-training within document tokenization presents a promising avenue for enhancing the semantic understanding of LMs.

In conclusion, GenRet marks a significant step forward in the quest for optimizing document retrieval tasks. Its innovative approach to learning document identifiers opens up new possibilities for leveraging generative models in information retrieval, setting the stage for future advancements in this exciting field.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.