Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Shallow Cross-Encoders for Low-Latency Retrieval (2403.20222v1)

Published 29 Mar 2024 in cs.IR and cs.CL

Abstract: Transformer-based Cross-Encoders achieve state-of-the-art effectiveness in text retrieval. However, Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window. However, keeping search latencies low is important for user satisfaction and energy usage. In this paper, we show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings since they can estimate the relevance of more documents in the same time budget. We further show that shallow transformers may benefit from the generalized Binary Cross-Entropy (gBCE) training scheme, which has recently demonstrated success for recommendation tasks. Our experiments with TREC Deep Learning passage ranking query sets demonstrate significant improvements in shallow and full-scale models in low-latency scenarios. For example, when the latency limit is 25ms per query, MonoBERT-Large (a cross-encoder based on a full-scale BERT model) is only able to achieve NDCG@10 of 0.431 on TREC DL 2019, while TinyBERT-gBCE (a cross-encoder based on TinyBERT trained with gBCE) reaches NDCG@10 of 0.652, a +51% gain over MonoBERT-Large. We also show that shallow Cross-Encoders are effective even when used without a GPU (e.g., with CPU inference, NDCG@10 decreases only by 3% compared to GPU inference with 50ms latency), which makes Cross-Encoders practical to run even without specialized hardware acceleration.

Citations (1)

Summary

  • The paper introduces a novel gBCE training scheme that enables shallow Cross-Encoders to gain over 50% improvement in NDCG@10 within a 25ms latency limit.
  • The methodology leverages increased negative sampling and calibration to mitigate overconfidence in reduced-layer models.
  • Experimental results validate that models like TinyBERT-gBCE can outperform full-scale counterparts in real-time document re-ranking tasks.

Shallow Cross-Encoders for Low-Latency Retrieval in Document Ranking Tasks

Introduction to the Tradeoffs Between Model Size and Latency in Cross-Encoders

Recent advances in deep learning have propelled Transformer-based models, like BERT and T5, to the forefront of text retrieval tasks due to their remarkable effectiveness. Among these, Cross-Encoders, which process query-document pairs jointly, excel in document re-ranking tasks. However, their computational demands pose significant latency issues, adversely affecting user satisfaction and sustainability goals due to increased energy consumption. Existing solutions have explored more efficient architectures like Bi-Encoders, but these approaches come with their own sets of complications, such as complexity in training and limitations in task/domain generalization.

This paper presents an in-depth exploration of shallow Cross-Encoders, which are substantially trimmed versions of their larger counterparts, positing that they can outperform full-scale models in scenarios constrained by low-latency requirements. The underlying premise is that shallow models, by virtue of their reduced complexity, can evaluate a greater number of documents within the same time frame, potentially leading to enhanced overall effectiveness despite individual query-document pair evaluations being less precise.

A Novel Training Approach: gBCE Scheme

One of the primary challenges in leveraging shallow Cross-Encoders lies in training them effectively without resorting to complex strategies like Knowledge Distillation. The paper introduces the application of the generalised Binary Cross-Entropy (gBCE) training scheme, which has demonstrated success in recommendation systems, as a straightforward and replicable method tailored for shallow Cross-Encoders. This approach incorporates an increased count of negative samples and uses a modified loss function to mitigate the overconfidence often observed in models trained under negative sampling.

The gBCE scheme is conceptualized to counterbalance the impact of using fewer layers in Cross-Encoders. By adjusting the model's confidence through a calibration parameter and enriching the training regimen with a diverse set of negative examples, the paper argues for the enhanced training efficiency of shallow models. The experimental results underscore the viability of this training scheme, showing significant improvements in model performance under latency constraints.

Exploring Efficiency and Effectiveness Tradeoffs

Through rigorous experimentation, the paper validates the hypothesis that shallow Cross-Encoders indeed offer superior efficiency/effectiveness tradeoffs in low-latency settings. Specifically, models like TinyBERT, trained with the gBCE scheme, are able to reach or even surpass the performance of much larger models, such as MonoBERT-Large, under stringent latency restrictions. For instance, within a 25ms latency limit, TinyBERT-gBCE achieves over 50% gain in NDCG@10 scores compared to its full-scale counterparts on the TREC DL 2019 queryset.

Moreover, the paper thoroughly investigates the individual and combined effects of the key components of the gBCE training strategy, demonstrating their positive impact on the model's performance, particularly in scenarios with a limited number of document evaluations. This nuanced analysis sheds light on the complex dynamics between model size, training methodology, and retrieval latency, contributing valuable insights to the ongoing discourse on optimizing large-LLMs for practical search applications.

Practical Implications and Future Directions

The findings have profound implications for deploying AI-driven text retrieval systems, especially in environments constrained by hardware resources or requiring rapid response times. The ability of shallow Cross-Encoders to operate efficiently without GPU acceleration, albeit with a slight compromise in effectiveness, extends their applicability to a broader range of real-world scenarios, from on-device applications to cost-sensitive cloud deployments.

Looking forward, the research opens up numerous avenues for further investigation, such as refining the gBCE training scheme, exploring other model architectures, and devising novel strategies to bridge the gap between small and full-scale models in high-latency contexts. As the field moves forward, the balance between computational efficiency and retrieval effectiveness will continue to be a critical area of focus, with shallow Cross-Encoders poised to play a pivotal role in shaping the next generation of search systems.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 79 likes.

Upgrade to Pro to view all of the tweets about this paper: