Emergent Mind

Abstract

Transformer structure has achieved great success in multiple applied machine learning communities, such as NLP, computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the language model's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility (https://github.com/zhichaoxu-shufe/RankMamba).

Training throughput comparison between models with approx 330M and >700M parameters using LoRA vs. flash attention.

Overview

  • The study evaluates Mamba, a new model structure based on Selective State Space Models (SSMs), against transformer-based models for document ranking tasks in information retrieval (IR).

  • Mamba aims to match the performance of transformer models while offering superior computational efficiency by addressing the quadratic computational complexity of the transformer's attention mechanism.

  • Benchmarking involves rigorous comparison with a variety of transformer architectures, focusing on encoder-only, decoder-only, and encoder-decoder frameworks.

  • Findings reveal Mamba models to be competitive, even surpassing transformer models in some cases, but highlight a lower training throughput as an area for future optimization.

Benchmarking Mamba's Document Ranking Performance Against Transformers

Comparative Evaluation of Document Ranking Models

In the sphere of information retrieval (IR), the emergence of transformer-based language models has significantly reshaped the way we understand and process natural language data. The study conducted by Zhichao Xu et al., focuses on evaluating the performance of a recent model structure, Mamba, within the context of the classical IR task of document ranking. The outcomes of this exploration provide nuanced insights into the competitive landscape of language models in terms of efficiency and efficacy.

Background and Model Overview

Transformer architectures have heralded advancements across various machine learning applications, notable for their capacity to capture long-range dependencies within sequences. Despite their success, the quadratic computational complexity of the attention mechanism has prompted efforts to devise more scalable alternatives. A noteworthy development in this endeavor is the Mamba model, which operates on the principles of Selective State Space Models (SSMs) to foster transformer-equivalent performance while aiming for superior computational efficiency.

Research Questions and Methodology

The core objective of the study was to ascertain whether Mamba models could offer performance on par with or superior to transformer-based models in document ranking tasks. The investigation entailed a rigorous benchmarking process, pitting Mamba against a diverse array of transformer-based models, including encoder-only, decoder-only, and encoder-decoder frameworks across different scales. The benchmark focused on models with varying pre-training objectives, sizes, and attention mechanisms, employing established training recipes and evaluating their performance through the lens of the document ranking task. This task necessitates a model's ability to discern and quantify the relevance between queries and documents, demanding both comprehensive understanding and contextual interpretation capabilities from the underlying language model.

Key Findings

The empirical analysis revealed several critical findings:

  • Encoder-only transformer models demonstrated robust performance in document ranking tasks, with roberta-large notably outperforming its counterparts in terms of the MRR metric on the MSMARCO Dev set.
  • Mamba models showcased competitive performance, sometimes matching or surpassing the transformer-based models' effectiveness. This is a considerable achievement, emphasizing Mamba's potential in handling complex IR tasks.
  • However, it was observed that Mamba models suffer from lower training throughput compared to advanced transformer implementations incorporating efficient attention mechanisms such as Flash Attention.

Implications and Future Directions

The findings from this study underscore Mamba models' viability as a potent alternative to transformer-based models for document ranking tasks, hinting at their broader applicability across classical IR tasks. Nonetheless, the noted deficiency in training throughput for Mamba models compared to some transformer models signifies a potential area for future optimization. This limitation does not diminish Mamba’s achievements but rather highlights a trajectory for enhancing its implementation to fully leverage its efficiency and scalability advantages.

Advancing Mamba's computational efficiency without compromising its performance could redefine the benchmarks for language model deployments in IR, offering a blend of efficacy and efficiency. As the IR field continues to evolve, the exploration of models like Mamba, which challenge the status quo and push the boundaries of computational efficiency, remains crucial in our ongoing quest to develop more capable, scalable, and efficient language processing systems.

Concluding Remarks

The study’s exploration into Mamba models within the domain of document ranking presents a promising avenue for future research. The competitive performance of Mamba models, juxtaposed with their current limitations in training throughput, offers a nuanced perspective on the potential and challenges of deploying SSM-based models in IR tasks. As we move forward, refining these models and overcoming their limitations will be paramount in harnessing their full potential, paving the way for their broader application across the diverse ecosystem of IR tasks.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.