Emergent Mind

Zamba: A Compact 7B SSM Hybrid Model

(2405.16712)
Published May 26, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

Zamba architecture with standard Mamba blocks, shared attention, MLP block, and memory-efficient increased performance.

Overview

  • The Zamba model introduced by Glorioso et al. integrates State-Space Models (SSMs) and transformers in a compact 7B parameter architecture, achieving high efficiency in natural language processing tasks.

  • Zamba employs a Mamba backbone enhanced by a shared global self-attention layer, maintaining constant parameter costs while optimizing memory and computational efficiency.

  • Benchmark tests reveal that Zamba, though trained on fewer tokens compared to leading models, demonstrates significant performance, underscoring the potential of efficient hybrid models for long-sequence processing.

Zamba: A Compact 7B SSM Hybrid Model

The paper "Zamba: A Compact 7B SSM Hybrid Model" by Glorioso et al. introduces Zamba, a novel 7B parameter model that combines the benefits of State-Space Models (SSMs) and transformers. Trained on 1T openly available tokens, Zamba makes a significant contribution to the landscape of low-parameter, high-efficiency natural language models.

Introduction

Transformers have been the cornerstone of advances in NLP, driven by their scalable architecture and self-attention mechanisms. However, their quadratic computational cost in relation to sequence length remains a bottleneck. This has led to various investigations into alternative architectures, notably SSMs, which promise more efficient sequence mixing via linear dynamical systems. The innovative contribution of Zamba lies in its hybrid architecture, which integrates Mamba-based SSM with a shared global self-attention module to mitigate the limitations inherent in SSMs without the heavy memory costs of full transformer models.

Architecture

Zamba leverages a Mamba backbone, an SSM architecture known for its input-dependent linear dynamical system. The standout feature of Zamba is the incorporation of a shared global self-attention (GSA) layer, which runs periodically across the Mamba layers. This design maintains constant parameter costs while reaping attention’s benefits for in-context learning and retrieval.

Mamba’s dynamics are formulated as: [ h{t+1} = \text{exp}(A\deltat) ht + Bt xt ] [ yt = Ct h{t+1} ] wherein ( xt ) is the input, ( ht ) the internal state, and ( yt ) the output. The parameters ( \deltat ), ( Bt ), and ( Ct ) are input-dependent, enhancing flexibility akin to transformers' attention mechanism. The GSA layer, periodically invoked, concatenates residual inputs with initial model inputs, processed through single-layer self-attention and MLP with shared weights. This innovative architectural approach optimizes both memory and computational efficiency.

Training Process

Training was bifurcated into two phases:

  1. Phase 1 (Pretraining): Conducted on 1T tokens from datasets like The Pile, RefinedWeb, and C4. The learning rate was set to decay slowly, cultivating a stable learning regime.
  2. Annealing Phase: This phase employed rapid learning rate decay over high-quality and synthetic datasets. A blend of original pretraining data (60%) and new high-quality data (40%) facilitated improved tuning.

Zamba’s dataset for phase 1 involved minimal filtering and deduplication strategies. The annealing phase incorporated a curriculum learning approach, inspired by recent studies showing that high-quality data can dramatically enhance pretraining efficacy. Zamba's performance metrics were significantly improved in the annealing phase, reinforcing the notion that high-quality curated data can optimize large language model performance.

Evaluation and Results

Zamba was benchmarked against leading open models like Llama 2, Mistral, and Gemma across diverse linguistic and reasoning tasks. While surpassed slightly by these models, Zamba demonstrated substantial benchmark efficacy considering its fewer training tokens (~1T compared to 15T for some competitors). Zero-shot evaluation results highlight that Zamba's annealed model closely trails leading models, outperforming Llama 2 on several benchmarks and nearing the efficiency of top-tier models trained on closed datasets.

In inference and generation efficiency, Zamba excels. It demonstrates superior forward-pass latency and memory usage due to its compact GSA architecture and efficient Mamba kernels, positioning it as an attractive model for long-sequence processing.

Implications and Future Work

The results from Zamba showcase that hybrid architectures, utilizing a blend of SSM and transformers, are viable alternatives to pure transformers, particularly for applications necessitating efficient inference and memory usage. Zamba’s success with a modest training budget ($~200k) and limited computational resources underscores the accessibility of competitive LLM training beyond industry giants.

Future directions should explore:

By releasing all training checkpoints, Zamba facilitates deeper research into learning dynamics and architectural impacts, fostering a more informed understanding of hybrid model training.

Conclusion

The Zamba model, introduced by Glorioso et al., stands out for its innovative use of a Mamba-based backbone combined with shared global self-attention, achieving competitive performance with notable efficiency in inference and memory usage. This model's release paves the way for more accessible, high-performance LLM development and offers a rich data source for academic and practical exploration into hybrid model architectures.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.