How Does Generative Retrieval Scale to Millions of Passages? (2305.11841v1)

Published 19 May 2023 in cs.IR and cs.CL

Abstract: Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer. Although many different approaches have been proposed to improve the effectiveness of generative retrieval, they have only been evaluated on document corpora on the order of 100k in size. We conduct the first empirical study of generative retrieval techniques across various corpus scales, ultimately scaling up to the entire MS MARCO passage ranking task with a corpus of 8.8M passages and evaluating model sizes up to 11B parameters. We uncover several findings about scaling generative retrieval to millions of passages; notably, the central importance of using synthetic queries as document representations during indexing, the ineffectiveness of existing proposed architecture modifications when accounting for compute cost, and the limits of naively scaling model parameters with respect to retrieval performance. While we find that generative retrieval is competitive with state-of-the-art dual encoders on small corpora, scaling to millions of passages remains an important and unsolved challenge. We believe these findings will be valuable for the community to clarify the current state of generative retrieval, highlight the unique challenges, and inspire new research directions.

Citations (52)

View on Semantic Scholar

Summary

The paper demonstrates that synthetic query generation is crucial for enhancing retrieval performance as corpus size increases.
It finds that increasing model parameters beyond a threshold yields diminishing returns in performance.
The study shows that naive scaling to T5-XL dimensions with synthetic queries outperforms more complex retrieval modifications on MS MARCO.

Empirical Evaluation of Generative Retrieval Techniques at Scale

Introduction

In the ongoing evolution of information retrieval systems, generative retrieval models have emerged as a promising alternative to traditional dense retrievers. These models bypass conventional indexing by directly generating document identifiers (docids) for a given query. This paper, conducted by researchers affiliated with Google Research and the University of Waterloo, represents the first systematic empirical evaluation of generative retrieval across various corpus scales, culminating in an assessment involving the entire MS MARCO passages corpus with model sizes up to 11 billion parameters.

Findings on Synthetic Queries and Model Scaling

The research underscores the pivotal role of synthetic queries as document representations, particularly as corpus size increases. Unlike other proposed architecture modifications, synthetic queries consistently enhance retrieval performance. Furthermore, this research highlights the limited benefits of escalating parameter counts beyond certain thresholds, challenging the notion that generative retrieval's effectiveness is intrinsically tied to model size.

Synthetic Queries as Central to Success

One of the paper's critical revelations is the singular importance of synthetic queries in enhancing retrieval effectiveness, especially against the backdrop of growing corpus sizes. Amidst the array of strategies explored, only synthetic query generation -- used as a means of simulating document content for indexing -- remained effective and crucial for performance as the corpus expanded. Moreover, the paper indicates that the enhanced performance afforded by synthetic queries surpasses that yielded by intricate model modifications or adjustments.

Compute Cost and the Naive Scaling Advantage

An intriguing outcome of this investigation is the efficiency of naively scaling model parameters, particularly when held against more sophisticated strategies like atomic identifiers or PAWA decoder enhancements. In scenarios where computational efficiency is paramount, naive parameter scaling emerges as superior in augmenting retrieval performance, provided the computational trade-offs are acceptable. This finding is particularly pronounced in experiments involving the full MS MARCO dataset, where a relatively straightforward approach of scaling model size to T5-XL dimensions and employing synthetic queries with Naive IDs outperformed more complex configurations.

Practical Implications and Future Research Directions

The insights gleaned from this investigation bear significant practical implications for the ongoing enhancement and application of generative retrieval models. Firstly, the critical role of synthetic queries in bolstering retrieval performance underscores the necessity of sophisticated query generation mechanisms, especially as the technology is applied to larger and more complex corpuses.

Secondly, the nuanced understanding of computational trade-offs in model scaling provides valuable guidance for future research endeavors. It suggests that while increasing model parameters can yield performance gains, there exists a point of diminishing returns. Consequently, future research might focus on optimizing parameter efficiency and exploring alternative scaling strategies that maximally leverage computational resources.

Concluding Remarks

This empirical paper marks a sizable advancement in our comprehension of generative retrieval's dynamics across varying corpus scales. By methodically evaluating the impacts of synthetic queries and model scaling strategies, the research delineates a path forward for enhancing the effectiveness of generative retrieval systems. As the field continues to evolve, these findings are set to inform the development of more efficient, scalable, and accurate information retrieval systems.

PDF Markdown