Emergent Mind


We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift in the out-of-domain scenario, where both the type of model and the number of documents to re-rank have an impact on effectiveness. Then, we focus on listwise rerankers based on LLMs -- especially GPT-4. While GPT-4 demonstrates impressive (zero-shot) performance, we show that traditional cross-encoders remain very competitive. Overall, our findings aim to to provide a more nuanced perspective on the recent excitement surrounding LLM-based re-rankers -- by positioning them as another factor to consider in balancing effectiveness and efficiency in search systems.


  • This study compares cross-encoders and LLM-based rerankers in information retrieval, using SPLADE models as first-stage retrievers.

  • Cross-encoders demonstrate substantial improvements in retrieval quality but are limited by computational requirements.

  • LLM-based rerankers, especially GPT-4, show competitive or superior performance but face challenges with operational costs and efficiency in handling large document sets.

  • The paper suggests a potential hybrid approach combining cross-encoders and LLMs to leverage their unique strengths.

Evaluating the Efficiency and Effectiveness of Cross-Encoders and LLM-Based Rerankers in Information Retrieval


The landscape of Information Retrieval (IR) has been dramatically reshaped with the introduction of neural reranking methods, particularly with the advent of LLMs for task-specific applications. This study provides a comprehensive comparison between two dominant paradigms within the domain of neural reranking: cross-encoders and LLM-based rerankers, using SPLADE models as effective first-stage retrievers. Through extensive evaluation on in-domain (TREC Deep Learning datasets) and out-of-domain datasets (BEIR and LoTTE), this research illuminates the nuanced advantages and limitations of employing cross-encoders versus LLM-based methods for reranking, providing key insights into their operational efficiency and effectiveness across varied IR contexts.

Cross-Encoders and LLM-Based Rerankers: A Comparative Analysis

The Efficacy of Cross-Encoders

Cross-encoders, exemplified by models such as DeBERTa-v3 and ELECTRA, have been the cornerstone for reranking efforts in IR systems due to their ability to model interactions between query-document pairs effectively. These models, when coupled with efficient retrievers like SPLADE-v3, demonstrate substantial improvements in retrieval quality across both in-domain and out-of-domain datasets. However, their performance is monumentally influenced by the number of top documents reranked (top_k) and can be hindered by extensive computational requirements, making large-scale or real-time applications challenging.

LLMs as Rerankers: The GPT-3.5 Turbo and GPT-4 Phenomenon

LLMs, especially GPT-4, have shown a surprising capability in reranking tasks even in a zero-shot setting. The study indicates GPT-4's performance is competitive and, in certain scenarios, superior to traditional cross-encoders. Nonetheless, two significant caveats accompany the employment of GPT models for reranking: the prohibitive operational costs associated with using models like GPT-4 and the inefficiency induced by the model's constraint to manage large sets of documents for reranking. These factors pose substantial barriers to the practical deployment of LLMs in real-world IR systems.

The Implications and Future Directions

The nuanced analysis provides several critical insights for the deployment of neural rerankers in IR systems:

  • Effectiveness and Efficiency Balance: While LLMs (particularly GPT-4) offer competitive or superior performance metrics, cross-encoders like DeBERTa-v3 provide a more balanced trade-off between effectiveness and operational efficiency.
  • Resilience to Varied IR Contexts: The comparative efficacy of cross-encoders and LLM-based rerankers is context-dependent, with each exhibiting strengths in different IR scenarios—cross-encoders being more versatile across domains and LLMs showing exceptional prowess in specific contexts.
  • Future of Reranking Pipelines: The analysis suggests the potential of combining cross-encoders and LLMs in cascading reranking pipelines to leverage the unique strengths of both approaches, pointing towards a hybrid future in neural reranking methodologies.


This study offers a granular investigation into the comparative merits of cross-encoders and LLM-based rerankers, framed by their application with the SPLADE models as first-stage retrievers. It presents a nuanced perspective that neither class of models universally outperforms the other across all IR tasks and settings. Instead, their deployment should be informed by a judicious assessment of the specific requirements and constraints of the application context, balancing the trade-offs between computational efficiency and retrieval effectiveness.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.