RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers (2403.18276v2)
Abstract: Transformer structure has achieved great success in multiple applied machine learning communities, such as NLP, computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the LLM's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility (https://github.com/zhichaoxu-shufe/RankMamba).
- “Asking clarifying questions in open-domain information-seeking conversations” In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval, 2019, pp. 475–484
- “Efficient index-based snippet generation” In ACM Transactions on Information Systems (TOIS) 32.2 ACM New York, NY, USA, 2014, pp. 1–24
- “Pythia: A suite for analyzing large language models across training and scaling” In International Conference on Machine Learning, 2023, pp. 2397–2430 PMLR
- Guy E Blelloch “Pre x sums and their applications” In Synthesis of Parallel Algorithms, pp. 35–60
- “Understanding performance of long-document ranking models through comprehensive evaluation and leaderboarding” In arXiv preprint arXiv:2207.01262, 2022
- Iz Beltagy, Matthew E Peters and Arman Cohan “Longformer: The long-document transformer” In arXiv preprint arXiv:2004.05150, 2020
- “Scaling instruction-finetuned language models” In arXiv preprint arXiv:2210.11416, 2022
- “Overview of the TREC 2019 deep learning track” In arXiv preprint arXiv:2003.07820, 2020
- “Overview of the TREC 2020 deep learning track. CoRR abs/2102.07662 (2021)” In arXiv preprint arXiv:2102.07662, 2021
- Tri Dao “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=mZn2Xyh9Ec
- “Deeper text understanding for IR with contextual neural language modeling” In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 985–988
- “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186 DOI: 10.18653/v1/N19-1423
- “Mamba: Linear-time sequence modeling with selective state spaces” In arXiv preprint arXiv:2312.00752, 2023
- Luyu Gao, Zhuyun Dai and Jamie Callan “Rethink training of BERT rerankers in multi-stage retrieval pipeline” In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, 2021, pp. 280–286 Springer
- Ankit Gupta, Albert Gu and Jonathan Berant “Diagonal state spaces are as effective as structured state spaces” In Advances in Neural Information Processing Systems 35, 2022, pp. 22982–22994
- Albert Gu, Karan Goel and Christopher Re “Efficiently Modeling Long Sequences with Structured State Spaces” In International Conference on Learning Representations, 2021
- “Hippo: Recurrent memory with optimal polynomial projections” In Advances in neural information processing systems 33, 2020, pp. 1474–1487
- “Combining recurrent, convolutional, and continuous-time models with linear state space layers” In Advances in neural information processing systems 34, 2021, pp. 572–585
- “On the parameterization and initialization of diagonal state space models” In Advances in Neural Information Processing Systems 35, 2022, pp. 35971–35983
- “Intra-document cascading: learning to select passages for neural document ranking” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1349–1358
- “Long short-term memory” In Neural computation 9.8 MIT press, 1997, pp. 1735–1780
- “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2021
- “What Language Model to Train if You Have One Million GPU Hours?” In Findings of the Association for Computational Linguistics: EMNLP 2022 Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 765–782 DOI: 10.18653/v1/2022.findings-emnlp.54
- “Decoupled Weight Decay Regularization” In International Conference on Learning Representations, 2019 URL: https://openreview.net/forum?id=Bkg6RiCqY7
- “PARADE: Passage Representation Aggregation forDocument Reranking” In ACM Transactions on Information Systems 42.2 ACM New York, NY, USA, 2023, pp. 1–26
- “Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2356–2362
- “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
- Jimmy Lin, Rodrigo Nogueira and Andrew Yates “Pretrained transformers for text ranking: Bert and beyond” Springer Nature, 2022
- “Fine-tuning llama for multi-stage text retrieval” In arXiv preprint arXiv:2310.08319, 2023
- “Parallelizing Linear Recurrent Neural Nets Over Sequence Length” In International Conference on Learning Representations, 2018
- “Distributed representations of words and phrases and their compositionality” In Advances in neural information processing systems 26, 2013
- “Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models” In Findings of the Association for Computational Linguistics: ACL 2022 Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 1864–1874 DOI: 10.18653/v1/2022.findings-acl.146
- “RWKV: Reinventing RNNs for the Transformer Era” In Findings of the Association for Computational Linguistics: EMNLP 2023 Singapore: Association for Computational Linguistics, 2023, pp. 14048–14077 DOI: 10.18653/v1/2023.findings-emnlp.936
- “Deep Contextualized Word Representations” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 2227–2237 DOI: 10.18653/v1/N18-1202
- “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
- “The probabilistic relevance framework: BM25 and beyond” In Foundations and Trends® in Information Retrieval 3.4 Now Publishers, Inc., 2009, pp. 333–389
- Jimmy TH Smith, Andrew Warrington and Scott Linderman “Simplified State Space Layers for Sequence Modeling” In The Eleventh International Conference on Learning Representations, 2022
- “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 353–355 DOI: 10.18653/v1/W18-5446
- “What language model architecture and pretraining objective works best for zero-shot generalization?” In International Conference on Machine Learning, 2022, pp. 22964–22984 PMLR
- “An in-depth investigation of user response simulation for conversational search” In arXiv preprint arXiv:2304.07944, 2023
- “Zero-shot clarifying question generation for conversational search” In Proceedings of the ACM Web Conference 2023, 2023, pp. 3288–3298
- “A lightweight constrained generation alternative for query-focused summarization” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 1745–1749
- “Counterfactual Editing for Search Result Explanation” In arXiv preprint arXiv:2301.10389, 2023
- Zhichao Xu “Context-aware Decoding Reduces Hallucination in Query-focused Summarization” In arXiv preprint arXiv:2312.14335, 2023
- Zhichao Xu, Hansi Zeng and Qingyao Ai “Understanding the effectiveness of reviews in e-commerce top-n recommendation” In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021, pp. 149–155
- Puxuan Yu, Razieh Rahimi and James Allan “Towards explainable search results: a listwise explanation generator” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 669–680
- “Opt: Open pre-trained transformer language models” In arXiv preprint arXiv:2205.01068, 2022
- “Rankt5: Fine-tuning t5 for text ranking with ranking losses” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2308–2313
- Hansi Zeng, Zhichao Xu and Qingyao Ai “A zero attentive relevance matching network for review modeling in recommendation system” In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part I 43, 2021, pp. 724–739 Springer