Emergent Mind

Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

(2405.03651)
Published May 6, 2024 in cs.IR and cs.LG

Abstract

Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.

Comparison of top-k-recall and task performance against indexing time for different methods on SciDocs.

Overview

  • The paper explores a method for improving efficiency in AI retrieval systems using sparse matrix factorization to approximate similarity calculations, reducing the resource demands of cross-encoders in large-scale applications.

  • A novel approach is proposed that constructs a sparse matrix of CE scores and uses matrix factorization to compute latent item embeddings, enabling faster retrieval with significantly fewer cross-encoder calls.

  • The method demonstrates up to a 100× speedup over previous methods, improved recall rates in k-nearest neighbor searches, and is effective across different datasets, suggesting broad applicability in various AI retrieval systems.

Deeper Insights: Optimizing Cross-Encoder $k$-NN Search with Sparse Matrix Factorization

Introduction to the Problem

Retrieval systems, particularly those used in AI applications like classification and entity linking, rely heavily on the ability to search a dataset for items most relevant to a given query. This typically involves the use of cross-encoders (CE) that calculate similarity by encoding a query-item pair, to directly output a similarity score. However, cross-encoders, often based on complex models like transformers, can be resource and computationally intensive due to the need for multiple forward passes of the model. This paper discusses a new method leveraging sparse matrix factorization to efficiently approximate these similarity calculations, enhancing the practicality of using cross-encoders in large-scale applications.

Existing Challenges with Current Methods

Current approaches fall into two primary categories: dual-encoder (DE) based methods and CUR matrix factorization. DE-based methods, while more efficient, often generalize poorly to new datasets and require re-tuning which is resource-intensive. CUR-based matrix factorization offers an improvement in accuracy but is computationally expensive and too slow for practical use at scale due to the high number of CE calls needed.

Innovations in the Proposed Method

This research proposes a novel approach using sparse-matrix factorization that overcomes the inefficiencies of previous methods. The process involves:

  • Constructing a Sparse Query-Item Matrix: A sparse matrix of CE scores is constructed using a selected subset of query-item pairs.
  • Factorization for Item Embeddings: The sparse matrix is then factorized to compute latent embeddings for items that approximate the cross-encoder outputs.
  • Efficient Test-Time Retrieval: At test time, the system refines a test query's embedding in multiple rounds to improve approximation of the CE scores, followed by retrieval of items based on these refined embeddings.

This approach, called Axn{}, limits the number of required CE calls significantly, offering up to 100× speedup compared to CUR-based methods, and improving the practicality of deploying CE-based methods at scale.

Core Results and Findings

Empirical evaluation of the proposed method reveals impressive results:

  • Enhanced Recall Rates: Achievements include up to 54% improvement in $k$=100 nearest-neighbor (NN) recall compared to DE-based methods.
  • Significant Speed Improvements: The method achieves up to 100× and 5× speedup over CUR-based and DE distillation-based approaches respectively.
  • Versatility across Application Domains: Effective across different datasets and tasks, including entity linking and information retrieval benchmarks.

Implications and Future Prospects

The approach opens various avenues for future research and practical applications:

  • Broad Applicability: It could be adapted to other AI retrieval systems where CE models are used, improving their efficiency and scalability.
  • Integration with Existing Systems: The method allows for integration with existing DE models, representing a bridge between current systems and more sophisticated retrieval methods.
  • Future Enhancements: Future work could focus on combining these approaches with other optimizations like early-exit strategies for neural models to further speed up the retrieval while maintaining accuracy.

Conclusion

This paper introduces a strategic advancement in the field of AI retrieval systems, addressing both the efficiency and scalability of cross-encoder based nearest neighbor searches. By incorporating sparse matrix factorization, the proposed method significantly reduces the computational overhead, making the practical deployment of high-performing cross-encoder models a feasible option for large-scale applications. While there are still areas for enhancement and deeper integration with other system components, the research presents a substantial step forward in tackling the previously prohibitive resource demands of cross-encoder models in AI applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.