DeeperImpact: Optimizing Sparse Learned Index Structures (2405.17093v2)
Abstract: A lot of recent work has focused on sparse learned indexes that use deep neural architectures to significantly improve retrieval quality while keeping the efficiency benefits of the inverted index. While such sparse learned structures achieve effectiveness far beyond those of traditional inverted index-based rankers, there is still a gap in effectiveness to the best dense retrievers, or even to sparse methods that leverage more expensive optimizations such as query expansion and query term weighting. We focus on narrowing this gap by revisiting and optimizing DeepImpact, a sparse retrieval approach that uses DocT5Query for document expansion followed by a BERT LLM to learn impact scores for document terms. We first reinvestigate the expansion process and find that the recently proposed Doc2Query -- query filtration does not enhance retrieval quality when used with DeepImpact. Instead, substituting T5 with a fine-tuned Llama 2 model for query prediction results in a considerable improvement. Subsequently, we study training strategies that have proven effective for other models, in particular the use of hard negatives, distillation, and pre-trained CoCondenser model initialization. Our results substantially narrow the effectiveness gap with the most effective versions of SPLADE.
- Bert: Pre-training of deep bidirectional transformers for language understanding, Preprint arXiv:1810.04805 (2018).
- Faster learned sparse retrieval with block-max pruning, in: The 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2024.
- Accelerating learned sparse indexes via term impact decomposition, in: Findings of the Association for Computational Linguistics: EMNLP 2022, 2022.
- Faster learned sparse retrieval with guided traversal, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022.
- Z. Dai, J. Callan, Context-aware sentence/passage term importance estimation for first stage retrieval, Preprint arXiv:1910.10687 (2019).
- R. Nogueira, J. Lin, From doc2query to doctttttquery, Online preprint (2019).
- Exploring the limits of transfer learning with a unified text-to-text transformer, Preprint arXiv:1910.10683 (2019).
- Sparterm: Learning term-based sparse representation for fast text retrieval, arXiv preprint arXiv:2010.00768 (2020).
- Splade-v3: New baselines for splade, arXiv preprint arXiv:2403.06789 (2024).
- COIL: revisit exact lexical match in information retrieval with contextualized inverted list, in: Proc. NAACL-HLT, 2021, pp. 3030–3042.
- S. Zhuang, G. Zuccon, Tilde: Term independent likelihood model for passage re-ranking, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1483–1492.
- RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 2825–2835.
- In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval, in: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), 2021, pp. 163–173.
- Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023).
- Ms marco: A human-generated machine reading comprehension dataset (2016).
- Overview of the trec 2019 deep learning track, Preprint arXiv:2003.07820 (2020).
- Overview of the trec 2020 deep learning track, Preprint arXiv:2102.07662 (2021).
- Anserini: Enabling the use of lucene for information retrieval research, in: Proc. SIGIR, 2017, pp. 1253–1256.
- Supporting interoperability between open-source search engines with the common index file format, in: Proc. SIGIR, 2020, pp. 2149–2152.
- Pisa: performant indexes and search for academia, OSIRRC@SIGIR (2019).
- H. Turtle, J. Flood, Query evaluation: strategies and optimizations, Information Processing & Management 31 (1995) 831–850.
- C. Lassance, S. Clinchant, An efficiency study for splade models, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, ACM, 2022. URL: http://dx.doi.org/10.1145/3477495.3531833. doi:10.1145/3477495.3531833.
- Soyuj Basnet (1 paper)
- Jerry Gou (1 paper)
- Antonio Mallia (9 papers)
- Torsten Suel (6 papers)