Zamba: A Compact 7B SSM Hybrid Model (2405.16712v1)

Published 26 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

References (65)

Authors (7)

Paolo Glorioso (32 papers)
Quentin Anthony (25 papers)
Yury Tokpanov (6 papers)
James Whittington (3 papers)
Jonathan Pilault (15 papers)
Adam Ibrahim (12 papers)
Beren Millidge (49 papers)

Citations (19)

View on Semantic Scholar

Summary

The paper introduces Zamba, a hybrid 7B model that integrates Mamba-based SSM with global self-attention to boost efficiency and performance on 1T tokens.
It employs a two-phase training process that combines gradual learning rate decay in pretraining with curriculum-based annealing using high-quality data.
Evaluation benchmarks reveal that Zamba achieves competitive inference speed and memory usage, making it well-suited for long-sequence NLP tasks.

Zamba: A Compact 7B SSM Hybrid Model

The paper "Zamba: A Compact 7B SSM Hybrid Model" by Glorioso et al. introduces Zamba, a novel 7B parameter model that combines the benefits of State-Space Models (SSMs) and transformers. Trained on 1T openly available tokens, Zamba makes a significant contribution to the landscape of low-parameter, high-efficiency natural LLMs.

Introduction

Transformers have been the cornerstone of advances in NLP, driven by their scalable architecture and self-attention mechanisms. However, their quadratic computational cost in relation to sequence length remains a bottleneck. This has led to various investigations into alternative architectures, notably SSMs, which promise more efficient sequence mixing via linear dynamical systems. The innovative contribution of Zamba lies in its hybrid architecture, which integrates Mamba-based SSM with a shared global self-attention module to mitigate the limitations inherent in SSMs without the heavy memory costs of full transformer models.

Architecture

Zamba leverages a Mamba backbone, an SSM architecture known for its input-dependent linear dynamical system. The standout feature of Zamba is the incorporation of a shared global self-attention (GSA) layer, which runs periodically across the Mamba layers. This design maintains constant parameter costs while reaping attention’s benefits for in-context learning and retrieval.

Mamba’s dynamics are formulated as: $h_{t+1} = \text{exp}(A\delta_t) h_t + B_t x_t$

$y_t = C_t h_{t+1}$

wherein $x_t$ is the input, $h_t$ the internal state, and $y_t$ the output. The parameters $\delta_t$ , $B_t$ , and $C_t$ are input-dependent, enhancing flexibility akin to transformers' attention mechanism. The GSA layer, periodically invoked, concatenates residual inputs with initial model inputs, processed through single-layer self-attention and MLP with shared weights. This innovative architectural approach optimizes both memory and computational efficiency.

Training Process

Training was bifurcated into two phases:

Phase 1 (Pretraining): Conducted on 1T tokens from datasets like The Pile, RefinedWeb, and C4. The learning rate was set to decay slowly, cultivating a stable learning regime.
Annealing Phase: This phase employed rapid learning rate decay over high-quality and synthetic datasets. A blend of original pretraining data (60%) and new high-quality data (40%) facilitated improved tuning.

Zamba’s dataset for phase 1 involved minimal filtering and deduplication strategies. The annealing phase incorporated a curriculum learning approach, inspired by recent studies showing that high-quality data can dramatically enhance pretraining efficacy. Zamba's performance metrics were significantly improved in the annealing phase, reinforcing the notion that high-quality curated data can optimize LLM performance.

Evaluation and Results

Zamba was benchmarked against leading open models like Llama 2, Mistral, and Gemma across diverse linguistic and reasoning tasks. While surpassed slightly by these models, Zamba demonstrated substantial benchmark efficacy considering its fewer training tokens (~1T compared to 15T for some competitors). Zero-shot evaluation results highlight that Zamba's annealed model closely trails leading models, outperforming Llama 2 on several benchmarks and nearing the efficiency of top-tier models trained on closed datasets.

In inference and generation efficiency, Zamba excels. It demonstrates superior forward-pass latency and memory usage due to its compact GSA architecture and efficient Mamba kernels, positioning it as an attractive model for long-sequence processing.

Implications and Future Work

The results from Zamba showcase that hybrid architectures, utilizing a blend of SSM and transformers, are viable alternatives to pure transformers, particularly for applications necessitating efficient inference and memory usage. Zamba’s success with a modest training budget ($~200k) and limited computational resources underscores the accessibility of competitive LLM training beyond industry giants.

Future directions should explore:

Scaling the Zamba architecture beyond 7B parameters.
Extending annealing and pretraining datasets to enhance model robustness.
Investigating potential optimizations in the GSA block placement and its impact on long-range dependencies.

By releasing all training checkpoints, Zamba facilitates deeper research into learning dynamics and architectural impacts, fostering a more informed understanding of hybrid model training.

Conclusion

The Zamba model, introduced by Glorioso et al., stands out for its innovative use of a Mamba-based backbone combined with shared global self-attention, achieving competitive performance with notable efficiency in inference and memory usage. This model's release paves the way for more accessible, high-performance LLM development and offers a rich data source for academic and practical exploration into hybrid model architectures.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1795298668200759588

https://twitter.com/QuentinAnthon15/status/1795520689132442085

https://twitter.com/QuentinAnthon15/status/1799132422929575957

https://twitter.com/KyeGomezB/status/1795508666243678515

https://twitter.com/gm8xx8/status/1795307775029117069

https://twitter.com/TheTuringPost/status/1798852631294799899

YouTube

Show All Videos