A Thorough Comparison of Cross-Encoders and LLMs for Reranking SPLADE (2403.10407v1)
Abstract: We present a comparative study between cross-encoder and LLMs rerankers in the context of re-ranking effective SPLADE retrievers. We conduct a large evaluation on TREC Deep Learning datasets and out-of-domain datasets such as BEIR and LoTTE. In the first set of experiments, we show how cross-encoder rerankers are hard to distinguish when it comes to re-rerank SPLADE on MS MARCO. Observations shift in the out-of-domain scenario, where both the type of model and the number of documents to re-rank have an impact on effectiveness. Then, we focus on listwise rerankers based on LLMs -- especially GPT-4. While GPT-4 demonstrates impressive (zero-shot) performance, we show that traditional cross-encoders remain very competitive. Overall, our findings aim to to provide a more nuanced perspective on the recent excitement surrounding LLM-based re-rankers -- by positioning them as another factor to consider in balancing effectiveness and efficiency in search systems.
- Yi: Open Foundation Models by 01.AI. arXiv:2403.04652 [cs.CL]
- Scaling Instruction-Finetuned Language Models. arXiv:2210.11416 [cs.LG]
- ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv:2003.10555 [cs.CL]
- Overview of the TREC 2019 deep learning track. arXiv:2003.07820 [cs.IR]
- Overview of the TREC 2021 Deep Learning Track. In Text Retrieval Conference. https://api.semanticscholar.org/CorpusID:261242374
- Overview of the TREC 2022 Deep Learning Track. In Text Retrieval Conference. https://api.semanticscholar.org/CorpusID:261302277
- Overview of the TREC 2023 Deep Learning Track. In Text REtrieval Conference (TREC). NIST, TREC. https://www.microsoft.com/en-us/research/publication/overview-of-the-trec-2023-deep-learning-track/
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:52967399
- From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective. arXiv:2205.04733 [cs.IR]
- Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. https://doi.org/10.48550/arXiv.2101.08751 arXiv:2101.08751 [cs].
- Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline. In European Conference on Information Retrieval. https://api.semanticscholar.org/CorpusID:231662379
- DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. arXiv:2111.09543 [cs.CL]
- Carlos Lassance and Stéphane Clinchant. 2023. Naver Labs Europe (SPLADE) @ TREC Deep Learning 2022. arXiv:2302.12574 [cs.IR]
- SPLADE-v3: New baselines for SPLADE. arXiv:2403.06789 [cs.IR]
- Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Found. Trends Inf. Retr. 3, 3 (mar 2009), 225–331. https://doi.org/10.1561/1500000016
- Fine-Tuning LLaMA for Multi-Stage Text Retrieval. https://doi.org/10.48550/arXiv.2310.08319 arXiv:2310.08319 [cs].
- Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]
- GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
- The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models. arXiv:2101.05667 [cs.IR]
- RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze! https://doi.org/10.48550/arXiv.2312.02724 arXiv:2312.02724 [cs].
- Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting. http://arxiv.org/abs/2306.17563 arXiv:2306.17563 [cs].
- Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389. https://doi.org/10.1561/1500000019
- Okapi at TREC-3.. In TREC (NIST Special Publication, Vol. 500-225), Donna K. Harman (Ed.). National Institute of Standards and Technology (NIST), 109–126. http://dblp.uni-trier.de/db/conf/trec/trec94.html#RobertsonWJHG94
- ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:244799249
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. https://doi.org/10.48550/arXiv.2304.09542 arXiv:2304.09542 [cs].
- Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models. http://arxiv.org/abs/2310.07712 arXiv:2310.07712 [cs].
- BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. ArXiv abs/2104.08663 (2021). https://api.semanticscholar.org/CorpusID:233296016
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944 [cs.LG]
- TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. SIGIR Forum 54, 1, Article 1 (feb 2021), 12 pages. https://doi.org/10.1145/3451964.3451965
- RankingGPT: Empowering Large Language Models in Text Ranking with Progressive Enhancement. http://arxiv.org/abs/2311.16720 arXiv:2311.16720 [cs].
- RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (, Taipei, Taiwan,) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 2308–2313. https://doi.org/10.1145/3539618.3592047
- A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models. http://arxiv.org/abs/2310.09497 arXiv:2310.09497 [cs].