Emergent Mind

Abstract

State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, by incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain underexplored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.

Hybrid architecture combines Transformer and Mamba for superior performance without positional encoding.

Overview

  • The paper investigates State-space models (SSMs), focusing on Mamba, in in-context learning (ICL) tasks in comparison with Transformer networks.

  • Mamba displays competitive ICL performance, excelling in sparse parity learning, but is less effective in tasks requiring retrieval like MQAR.

  • A novel hybrid model, MambaFormer, is introduced, combining Mamba with elements of Transformers, showing proficiency in ICL tasks where individual models fail.

  • The study presents strong numerical evidence of MambaFormer's comprehensive ICL ability, suggesting the potential of hybrid architectures for future research.

Overview

State-space models (SSMs) like Mamba have emerged as potential alternatives to Transformer networks for tasks such as language modeling. A paper examines the capabilities of SSMs, particularly Mamba, in in-context learning (ICL) tasks compared to Transformers. The study also explores a hybrid model named MambaFormer, which incorporates both architectures, aiming to capitalize on the strengths of each.

ICL Performance of SSMs

LLMs, typified by Transformers, are known for their ICL capabilities, where they can execute tasks with minimal examples and no parameter tuning. Despite their efficiency, SSMs have been less studied in this regard. This paper sets out to assess the ICL potential of SSMs, particularly the Mamba model, across a spectrum of tasks. SSMs display competitive ICL performance, aligning with that of Transformers on most tasks. Notably, Mamba excels at sparse parity learning but shows limitations in tasks like vector-valued multi-query associative recall (MQAR), where retrieval is key.

Hybrid Model: MambaFormer

To address SSMs' shortcomings, the research introduces a novel hybrid model, MambaFormer, which merges Mamba with multi-head attention layers from Transformers. Unlike separate models, MambaFormer surpasses them in tasks where they individually fail. For instance, while Mamba shows proficiency in sparse parity learning, where Transformers falter, the hybrid model demonstrates proficiency across all evaluated tasks, including retrieval.

Strong Numerical Results & Key Insights

The paper presents strong numerical results, notably in Table 1, which delineates performance across different models and tasks using a labeling system (✓ for success, ✗ for failure, and ▲ for performance gaps). The MambaFormer achieves the mark of ✓ across all tasks, signaling its all-around proficiency. In complex ICL tasks like decision tree learning and sparse parity, the hybrid model leverages both its components, demonstrating impressive gains over individual architectures.

Conclusion and Future Directions

Concluding that both SSMs and Transformers possess distinct advantages for ICL tasks, the paper proposes that hybrid architectures like MambaFormer should be further explored to enhance the language models' ICL capabilities. The researchers acknowledge the limitation of focusing on non-language ICL tasks and smaller model scales but suggest no fundamental hurdle for Mamba's ICL performance. Future research may focus on comparing SSM and Transformer architectures for more generalized ICL tasks in language settings at higher parameters, potentially offering newer insights into the architecture of LLMs.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube