Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling (2406.07522v3)

Published 11 Jun 2024 in cs.CL and cs.LG

Abstract: Efficiently modeling sequences with infinite context length has long been a challenging problem. Previous approaches have either suffered from quadratic computational complexity or limited extrapolation ability in length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall recent memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and demonstrate that it significantly outperforms state-of-the-art models across a variety of benchmarks. Pretrained on sequences of 4K length, Samba shows improved perplexity in context lengths of up to 1M in zero-shot. When finetuned on 4K-length sequences, Samba efficiently extrapolates to a 256K context length with perfect memory recall on the Passkey Retrieval task, and exhibits superior retrieval extrapolation on the challenging Phonebook task compared to full-attention models. As a linear-time sequence model, Samba achieves a 3.73x higher throughput compared to Transformers with grouped-query attention for user prompts of 128K length, and a 3.64x speedup when generating 64K tokens with unlimited streaming. Our code for training on open source data is publicly available at https://github.com/microsoft/Samba.

Citations (28)

View on Semantic Scholar

Summary

The paper's main contribution is a hybrid model that integrates Mamba's state-space approach with sliding window attention to achieve linear complexity for unlimited context.
It demonstrates scalability with up to 3.8B parameters and superior performance on challenging tasks like GSM8K and HumanEval.
Samba offers significant efficiency gains, achieving up to 3.73× higher throughput compared to traditional attention-based models.

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context LLMing

The paper "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context LLMing" addresses a persistent challenge in sequence modeling: efficiently handling sequences with theoretically unlimited context length. Traditional models like transformers exhibit either quadratic computational complexity or face difficulties in extrapolating beyond their training lengths. Samba, a hybrid neural architecture, is proposed to layer-wise combine Mamba, a selective state space model (SSM), with Sliding Window Attention (SWA) to deliver both efficiency and performance in handling long sequences.

Technical Contributions

Samba harmonizes the advantages of both Mamba and SWA and achieves linear computational complexity for unlimited-length sequences. The key contributions of the paper include:

Hybrid Architecture: Combining Mamba, an SSM-based model, with SWA ensures the efficient handling of long-range dependencies. Mamba provides a recurrent backbone for sequence modeling, while SWA focuses on precise memory recall capabilities.
Scalability: The model scales up to 3.8 billion parameters and processes over 3.2 trillion training tokens.
Performance: Samba outperforms state-of-the-art models based on pure attention or SSMs across a variety of benchmarks. This is especially notable in challenging tasks such as mathematics and coding (GSM8K, HumanEval).
Efficiency: Samba demonstrates significantly higher throughput compared to similar models, achieving up to 3.73 times higher prompt processing throughput and 3.64 times speedup in token generation.

Methodology

Samba's architecture consists of several key components:

Mamba Layers: An SSM-based layer that captures recurrent sequence structures. The Mamba layer selectively compresses the input sequence into recurrent hidden states using input-dependent gating mechanisms.
SWA Layers: Sliding Window Attention layers enable precise memory recall through a sliding attention window, maintaining linear computational complexity.
MLP Layers: Multi-Layer Perceptron layers contribute to nonlinear transformations and factual knowledge recall.

Training Configurations and Implementations

Samba is trained on datasets like SlimPajama and Phi-2 using various scales of parameters (421M, 1.3B, 1.7B, and 3.8B). For training, SWA is optimized for a window size of 2048, based on empirical observations of training efficiency and model performance. The models combine different layer components to balance complexity and accuracy effectively.

Experimental Results

Samba demonstrates superior performance in downstream tasks compared to both transformer-based and SSM-based models.

Commonsense Reasoning: In benchmarks like ARC and WinoGrande, Samba achieves higher accuracy compared to baseline models like Mistral and Llama-3.
Language Understanding: Tasks such as MMLU and SQuAD show significant performance improvements.
Truthfulness and Arithmetic: Samba excels in TruthfulQA and GSM8K benchmarks, indicating a robust handling of diverse tasks.
Efficiency: The model achieves scalable processing speed, maintaining consistent performance across various sequence lengths.

Analysis and Discussion

The paper provides detailed analyses on several aspects:

Architecture and Hybridization: The performance variations of different hybridization strategies (Mamba-SWA-MLP vs. Mamba-MLP) highlight the effectiveness of combining SSM-based layers with SWA for tasks requiring memory recall and sequence extrapolation.
Parameter Allocation: The optimal number of query and key-value heads for SWA layers is explored to balance computational efficiency and model efficacy.
Entropy Analysis: Examining the entropy of attention distributions reveals specialization patterns within the hybrid architecture, where different layers focus on global information integration and precise retrieval.

Future Directions

The paper suggests possible future developments:

Dynamic Architecture Adaptations: Task-adaptive dynamic architectures could further enhance performance by selectively activating different model components.
Advanced Hybridization Techniques: Improving retrieval capabilities of Samba through more sophisticated hybridization strategies without compromising efficiency.

Implications

The implications of Samba are both practical and theoretical:

Practical: Samba can be directly applied to tasks requiring extensive context understanding, such as document summarization and natural language processing applications with long sequences.
Theoretical: The design of hybrid state space models presents a novel approach to combining the strengths of attention mechanisms and state space models, paving the way for new architectures in sequence modeling.

In conclusion, the paper presents Samba as a significant advancement in efficient unlimited context LLMing, effectively merging state space models with sliding window attention to achieve unparalleled performance and processing efficiency.

Related Papers

Tweets

https://twitter.com/liliang_ren/status/1801027052147216457

https://twitter.com/rasbt/status/1801954595020640260

https://twitter.com/rohanpaul_ai/status/1802790198331252914

https://twitter.com/fly51fly/status/1801934535896318348

https://twitter.com/rohanpaul_ai/status/1816589564418588774

https://twitter.com/hillbig/status/1802458950685860239