Emergent Mind

Abstract

Learned sparse representations form an attractive class of contextual embeddings for text retrieval. That is so because they are effective models of relevance and are interpretable by design. Despite their apparent compatibility with inverted indexes, however, retrieval over sparse embeddings remains challenging. That is due to the distributional differences between learned embeddings and term frequency-based lexical models of relevance such as BM25. Recognizing this challenge, a great deal of research has gone into, among other things, designing retrieval algorithms tailored to the properties of learned sparse representations, including approximate retrieval systems. In fact, this task featured prominently in the latest BigANN Challenge at NeurIPS 2023, where approximate algorithms were evaluated on a large benchmark dataset by throughput and recall. In this work, we propose a novel organization of the inverted index that enables fast yet effective approximate retrieval over learned sparse embeddings. Our approach organizes inverted lists into geometrically-cohesive blocks, each equipped with a summary vector. During query processing, we quickly determine if a block must be evaluated using the summaries. As we show experimentally, single-threaded query processing using our method, Seismic, reaches sub-millisecond per-query latency on various sparse embeddings of the MS MARCO dataset while maintaining high recall. Our results indicate that Seismic is one to two orders of magnitude faster than state-of-the-art inverted index-based solutions and further outperforms the winning (graph-based) submissions to the BigANN Challenge by a significant margin.

Seismic's design features geometrically-partitioned inverted lists and blocks for efficient query-document similarity approximation.

Overview

  • The paper introduces Seismic, an algorithm improving search efficiency on learned sparse embeddings by restructuring traditional inverted indexes combined with a forward index.

  • Seismic utilizes static pruning, blocking of cohesive blocks, summary vectors, and dual threshold querying to expedite the retrieval process and manage size and scalability effectively.

  • Experimental results using the Ms Marco dataset show Seismic's superior performance in reducing latency and maintaining accuracy, proving its potential for scaling to large datasets and advancing retrieval technology.

Efficient Inverted Indexes for Learned Sparse Representations

Introduction

The paper presents a new approximate retrieval algorithm named Seismic, which enhances the search efficiency on learned sparse embeddings. The underpinning challenge addressed is the unsuitability of traditional inverted index-based retrieval techniques, such as WAND or MaxScore, when applied directly to learned sparse embeddings due to their distinct distribution characteristics compared to term frequency-based models like BM25. Seismic reorganizes the inverted index and amalgamates it with a forward index, optimizing the search process through strategic blocking and summarization of inverted lists.

Methodology

Seismic introduces a novel framework for indexing and retrieval that operates on geometrically cohesive blocks within an inverted index, each supplemented by a summary vector. The method can be delineated through the following components:

  • Static Pruning and Blocking: Inverted lists for each dictionary term are truncated to retain entries only up to a threshold, reducing the index size. These pruned lists are then partitioned into blocks via a clustering algorithm, enhancing the cohesion within each block.
  • Summary Vectors: Summaries are constructed for each block to approximate the maximum inner product a query might achieve with any document in the block. During querying, these summaries quickly ascertain whether a block contains potential candidate documents, thereby accelerating the querying process.
  • Forward Index: Alongside the inverted index, a forward index is used to store exact document representations, facilitating precise computation of the inner product when a document needs to be scored.
  • Query Processing: The querying mechanism involves a dual thresholding with heap data structures that efficiently manage the top-scoring documents, leveraging the summary vectors to bypass unlikely candidate blocks.

Experimental Results

The evaluation of Seismic is conducted against several strong baselines using the Ms Marco dataset and various learned sparse embeddings including Splade and Efficient Splade (E-Splade). The results are promising:

  • Latency and Accuracy: Seismic offers substantial improvements in query latency, reaching sub-millisecond levels while maintaining competitive retrieval accuracy metrics. Compared to other state-of-the-art solutions, it can achieve latency reductions by an order of magnitude or more, depending on the embedding and configuration.
  • Scalability: With respect to index size and build time, Seismic is shown to be efficient, creating compact and quickly constructable indexes that facilitate scalability to large datasets.

Theoretical Implications and Practical Applications

The development of Seismic contributes significantly both theoretically and practically. Theoretically, it challenges existing assumptions about the structures required for efficient inverted index-based retrieval by introducing a novel block-summary paradigm. Practically, it opens up new possibilities for implementing efficient and scalable information retrieval systems capable of handling modern sparse embeddings, which are increasingly prevalent due to their effectiveness and interpretability.

Future Directions

Potential future work includes exploring additional compression techniques for summaries and inverted lists to further enhance efficiency. Another interesting avenue could be the adaptation of Seismic's methodology to other forms of vector embeddings or different domains requiring efficient retrieval mechanisms.

Overall, Seismic represents a significant advancement in the field of information retrieval, particularly in the context of searching over learned sparse representations, and sets the stage for further innovations in this area.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.