Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders (2405.03651v1)

Published 6 May 2024 in cs.IR and cs.LG

Abstract: Cross-encoder (CE) models which compute similarity by jointly encoding a query-item pair perform better than embedding-based models (dual-encoders) at estimating query-item relevance. Existing approaches perform k-NN search with CE by approximating the CE similarity with a vector embedding space fit either with dual-encoders (DE) or CUR matrix factorization. DE-based retrieve-and-rerank approaches suffer from poor recall on new domains and the retrieval with DE is decoupled from the CE. While CUR-based approaches can be more accurate than the DE-based approach, they require a prohibitively large number of CE calls to compute item embeddings, thus making it impractical for deployment at scale. In this paper, we address these shortcomings with our proposed sparse-matrix factorization based method that efficiently computes latent query and item embeddings to approximate CE scores and performs k-NN search with the approximate CE similarity. We compute item embeddings offline by factorizing a sparse matrix containing query-item CE scores for a set of train queries. Our method produces a high-quality approximation while requiring only a fraction of CE calls as compared to CUR-based methods, and allows for leveraging DE to initialize the embedding space while avoiding compute- and resource-intensive finetuning of DE via distillation. At test time, the item embeddings remain fixed and retrieval occurs over rounds, alternating between a) estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and b) using the updated test query embedding for retrieving more items. Our k-NN search method improves recall by up to 5% (k=1) and 54% (k=100) over DE-based approaches. Additionally, our indexing approach achieves a speedup of up to 100x over CUR-based and 5x over DE distillation methods, while matching or improving k-NN search recall over baselines.

Summary

The paper introduces a sparse matrix factorization method that approximates cross-encoder outputs to drastically lower computational demands.
It demonstrates up to a 54% improvement in k=100 recall and achieves 100× speedup compared to traditional CUR-based methods.
The approach enhances scalability and integrates with existing retrieval systems, broadening its impact across AI applications.

Deeper Insights: Optimizing Cross-Encoder $k$ -NN Search with Sparse Matrix Factorization

Introduction to the Problem

Retrieval systems, particularly those used in AI applications like classification and entity linking, rely heavily on the ability to search a dataset for items most relevant to a given query. This typically involves the use of cross-encoders (CE) that calculate similarity by encoding a query-item pair, to directly output a similarity score. However, cross-encoders, often based on complex models like transformers, can be resource and computationally intensive due to the need for multiple forward passes of the model. This paper discusses a new method leveraging sparse matrix factorization to efficiently approximate these similarity calculations, enhancing the practicality of using cross-encoders in large-scale applications.

Existing Challenges with Current Methods

Current approaches fall into two primary categories: dual-encoder (DE) based methods and CUR matrix factorization. DE-based methods, while more efficient, often generalize poorly to new datasets and require re-tuning which is resource-intensive. CUR-based matrix factorization offers an improvement in accuracy but is computationally expensive and too slow for practical use at scale due to the high number of CE calls needed.

Innovations in the Proposed Method

This research proposes a novel approach using sparse-matrix factorization that overcomes the inefficiencies of previous methods. The process involves:

Constructing a Sparse Query-Item Matrix: A sparse matrix of CE scores is constructed using a selected subset of query-item pairs.
Factorization for Item Embeddings: The sparse matrix is then factorized to compute latent embeddings for items that approximate the cross-encoder outputs.
Efficient Test-Time Retrieval: At test time, the system refines a test query's embedding in multiple rounds to improve approximation of the CE scores, followed by retrieval of items based on these refined embeddings.

This approach, called Axn{}, limits the number of required CE calls significantly, offering up to 100× speedup compared to CUR-based methods, and improving the practicality of deploying CE-based methods at scale.

Core Results and Findings

Empirical evaluation of the proposed method reveals impressive results:

Enhanced Recall Rates: Achievements include up to 54% improvement in $k$ =100 nearest-neighbor (NN) recall compared to DE-based methods.
Significant Speed Improvements: The method achieves up to 100× and 5× speedup over CUR-based and DE distillation-based approaches respectively.
Versatility across Application Domains: Effective across different datasets and tasks, including entity linking and information retrieval benchmarks.

Implications and Future Prospects

The approach opens various avenues for future research and practical applications:

Broad Applicability: It could be adapted to other AI retrieval systems where CE models are used, improving their efficiency and scalability.
Integration with Existing Systems: The method allows for integration with existing DE models, representing a bridge between current systems and more sophisticated retrieval methods.
Future Enhancements: Future work could focus on combining these approaches with other optimizations like early-exit strategies for neural models to further speed up the retrieval while maintaining accuracy.

Conclusion

This paper introduces a strategic advancement in the field of AI retrieval systems, addressing both the efficiency and scalability of cross-encoder based nearest neighbor searches. By incorporating sparse matrix factorization, the proposed method significantly reduces the computational overhead, making the practical deployment of high-performing cross-encoder models a feasible option for large-scale applications. While there are still areas for enhancement and deeper integration with other system components, the research presents a substantial step forward in tackling the previously prohibitive resource demands of cross-encoder models in AI applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ZainHasan6/status/1809773468344545404

https://twitter.com/_reachsumit/status/1787687430235291891

https://twitter.com/fly51fly/status/1789281147333365783

https://twitter.com/themintsv/status/1787678549631173002

https://twitter.com/gm8xx8/status/1787662087130386536