Emergent Mind

Shallow Cross-Encoders for Low-Latency Retrieval

(2403.20222)
Published Mar 29, 2024 in cs.IR and cs.CL

Abstract

Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.

Tradeoffs between latency and NDCG on TREC-DL2020 queryset depicted.

Overview

  • This paper examines the potential of shallow Cross-Encoders, with reduced model complexity, to enhance document retrieval tasks in conditions demanding low latency, suggesting they could outperform larger models under such constraints.

  • A novel training methodology, the generalized Binary Cross-Entropy (gBCE) scheme, is introduced. This method, specially designed for shallow Cross-Encoders, involves using a higher quantity of negative samples and a recalibrated loss function to improve training efficiency and effectiveness.

  • Experimental results demonstrate that models like TinyBERT, when trained with the gBCE scheme, can achieve significant performance gains under tight latency conditions, for example, over 50% improvement in NDCG@10 scores within a 25ms latency cap.

  • The study underscores the implications of deploying shallow Cross-Encoders in resource-constrained environments or where rapid response is crucial, and suggests pathways for future research to further optimize these models for practical search applications.

Shallow Cross-Encoders for Low-Latency Retrieval in Document Ranking Tasks

Introduction to the Tradeoffs Between Model Size and Latency in Cross-Encoders

Recent advances in deep learning have propelled Transformer-based models, like BERT and T5, to the forefront of text retrieval tasks due to their remarkable effectiveness. Among these, Cross-Encoders, which process query-document pairs jointly, excel in document re-ranking tasks. However, their computational demands pose significant latency issues, adversely affecting user satisfaction and sustainability goals due to increased energy consumption. Existing solutions have explored more efficient architectures like Bi-Encoders, but these approaches come with their own sets of complications, such as complexity in training and limitations in task/domain generalization.

This paper presents an in-depth exploration of shallow Cross-Encoders, which are substantially trimmed versions of their larger counterparts, positing that they can outperform full-scale models in scenarios constrained by low-latency requirements. The underlying premise is that shallow models, by virtue of their reduced complexity, can evaluate a greater number of documents within the same time frame, potentially leading to enhanced overall effectiveness despite individual query-document pair evaluations being less precise.

A Novel Training Approach: gBCE Scheme

One of the primary challenges in leveraging shallow Cross-Encoders lies in training them effectively without resorting to complex strategies like Knowledge Distillation. The paper introduces the application of the generalised Binary Cross-Entropy (gBCE) training scheme, which has demonstrated success in recommendation systems, as a straightforward and replicable method tailored for shallow Cross-Encoders. This approach incorporates an increased count of negative samples and uses a modified loss function to mitigate the overconfidence often observed in models trained under negative sampling.

The gBCE scheme is conceptualized to counterbalance the impact of using fewer layers in Cross-Encoders. By adjusting the model's confidence through a calibration parameter and enriching the training regimen with a diverse set of negative examples, the paper argues for the enhanced training efficiency of shallow models. The experimental results underscore the viability of this training scheme, showing significant improvements in model performance under latency constraints.

Exploring Efficiency and Effectiveness Tradeoffs

Through rigorous experimentation, the paper validates the hypothesis that shallow Cross-Encoders indeed offer superior efficiency/effectiveness tradeoffs in low-latency settings. Specifically, models like TinyBERT, trained with the gBCE scheme, are able to reach or even surpass the performance of much larger models, such as MonoBERT-Large, under stringent latency restrictions. For instance, within a 25ms latency limit, TinyBERT-gBCE achieves over 50% gain in NDCG@10 scores compared to its full-scale counterparts on the TREC DL 2019 queryset.

Moreover, the study thoroughly investigates the individual and combined effects of the key components of the gBCE training strategy, demonstrating their positive impact on the model's performance, particularly in scenarios with a limited number of document evaluations. This nuanced analysis sheds light on the complex dynamics between model size, training methodology, and retrieval latency, contributing valuable insights to the ongoing discourse on optimizing large-language models for practical search applications.

Practical Implications and Future Directions

The findings have profound implications for deploying AI-driven text retrieval systems, especially in environments constrained by hardware resources or requiring rapid response times. The ability of shallow Cross-Encoders to operate efficiently without GPU acceleration, albeit with a slight compromise in effectiveness, extends their applicability to a broader range of real-world scenarios, from on-device applications to cost-sensitive cloud deployments.

Looking forward, the research opens up numerous avenues for further investigation, such as refining the gBCE training scheme, exploring other model architectures, and devising novel strategies to bridge the gap between small and full-scale models in high-latency contexts. As the field moves forward, the balance between computational efficiency and retrieval effectiveness will continue to be a critical area of focus, with shallow Cross-Encoders poised to play a pivotal role in shaping the next generation of search systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.