Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zamba: A Compact 7B SSM Hybrid Model (2405.16712v1)

Published 26 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
  2. AI2 (2024). OLMo 1.7–7B: A 24 point improvement on MMLU. https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d. Accessed: May 26, 2024.
  3. Blackmamba: Mixture of experts for state-space models. arXiv preprint arXiv:2402.01771.
  4. Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36.
  5. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  7. Language models are few-shot learners. CoRR, abs/2005.14165.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  9. Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning.
  10. Griffin: Mixing gated linear recurrences with local attention for efficient language models.
  11. Universal transformers. arXiv preprint arXiv:1807.03819.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929.
  13. Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99.
  14. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  15. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  16. A framework for few-shot language model evaluation.
  17. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
  18. Is mamba capable of in-context learning?
  19. Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838.
  20. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
  21. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487.
  22. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
  23. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585.
  24. Textbooks are all you need. arXiv preprint arXiv:2306.11644.
  25. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014.
  26. Measuring massive multitask language understanding.
  27. Training compute-optimal large language models.
  28. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
  29. Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.
  30. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  31. Repeat after me: Transformers are better than state space models at copying.
  32. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  33. Downstream datasets make surprisingly good pretraining corpora. In The 61st Annual Meeting Of The Association For Computational Linguistics.
  34. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  35. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
  36. Jamba: A hybrid transformer-mamba language model.
  37. Curriculum learning for natural answer generation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4223–4229.
  38. Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
  39. Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.
  40. NVIDIA (2023). Transformer Engine: A library for accelerating Transformer models on NVIDIA GPUs. https://github.com/NVIDIA/TransformerEngine. Accessed: May 26, 2024.
  41. Can mamba learn how to learn? a comparative study on in-context learning tasks.
  42. Nemotron-4 15b technical report. arXiv preprint arXiv:2402.16819.
  43. Nemotron-4 15b technical report.
  44. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  45. Rwkv: Reinventing rnns for the transformer era.
  46. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.
  47. Mechanistic design and scaling of hybrid architectures. arXiv preprint arXiv:2403.17844.
  48. Scaling language models: Methods, analysis & insights from training gopher.
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  50. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR.
  51. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations.
  52. Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413.
  53. peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI. ODC-By, https://github.com/allenai/pes2o.
  54. Retentive network: A successor to transformer for large language models.
  55. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  56. D4: Improving llm pretraining via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36.
  57. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  58. Hybrid predictive coding: Inferring, fast and slow. PLoS Computational Biology, 19(8):e1011280.
  59. Going in circles is the way forward: the role of recurrence in visual inference. Current Opinion in Neurobiology, 65:176–193.
  60. Attention is all you need. Advances in neural information processing systems, 30.
  61. The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal Formation. Cell, 183(5):1249–1263.e23. Publisher: Elsevier Inc.
  62. Relating transformers to models and neural representations of the hippocampal formation. International Conference on Learning Representations.
  63. Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36.
  64. Alphafold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 8(1):115.
  65. Zyphra (2024). Reproduction of Mamba-370M by Zyphra. https://huggingface.co/Zyphra/Mamba-370M. Accessed: May 26, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Paolo Glorioso (32 papers)
  2. Quentin Anthony (25 papers)
  3. Yury Tokpanov (6 papers)
  4. James Whittington (3 papers)
  5. Jonathan Pilault (15 papers)
  6. Adam Ibrahim (12 papers)
  7. Beren Millidge (49 papers)
Citations (19)

Summary

  • The paper introduces Zamba, a hybrid 7B model that integrates Mamba-based SSM with global self-attention to boost efficiency and performance on 1T tokens.
  • It employs a two-phase training process that combines gradual learning rate decay in pretraining with curriculum-based annealing using high-quality data.
  • Evaluation benchmarks reveal that Zamba achieves competitive inference speed and memory usage, making it well-suited for long-sequence NLP tasks.

Zamba: A Compact 7B SSM Hybrid Model

The paper "Zamba: A Compact 7B SSM Hybrid Model" by Glorioso et al. introduces Zamba, a novel 7B parameter model that combines the benefits of State-Space Models (SSMs) and transformers. Trained on 1T openly available tokens, Zamba makes a significant contribution to the landscape of low-parameter, high-efficiency natural LLMs.

Introduction

Transformers have been the cornerstone of advances in NLP, driven by their scalable architecture and self-attention mechanisms. However, their quadratic computational cost in relation to sequence length remains a bottleneck. This has led to various investigations into alternative architectures, notably SSMs, which promise more efficient sequence mixing via linear dynamical systems. The innovative contribution of Zamba lies in its hybrid architecture, which integrates Mamba-based SSM with a shared global self-attention module to mitigate the limitations inherent in SSMs without the heavy memory costs of full transformer models.

Architecture

Zamba leverages a Mamba backbone, an SSM architecture known for its input-dependent linear dynamical system. The standout feature of Zamba is the incorporation of a shared global self-attention (GSA) layer, which runs periodically across the Mamba layers. This design maintains constant parameter costs while reaping attention’s benefits for in-context learning and retrieval.

Mamba’s dynamics are formulated as: ht+1=exp(Aδt)ht+Btxth_{t+1} = \text{exp}(A\delta_t) h_t + B_t x_t

yt=Ctht+1y_t = C_t h_{t+1}

wherein xtx_t is the input, hth_t the internal state, and yty_t the output. The parameters δt\delta_t, BtB_t, and CtC_t are input-dependent, enhancing flexibility akin to transformers' attention mechanism. The GSA layer, periodically invoked, concatenates residual inputs with initial model inputs, processed through single-layer self-attention and MLP with shared weights. This innovative architectural approach optimizes both memory and computational efficiency.

Training Process

Training was bifurcated into two phases:

  1. Phase 1 (Pretraining): Conducted on 1T tokens from datasets like The Pile, RefinedWeb, and C4. The learning rate was set to decay slowly, cultivating a stable learning regime.
  2. Annealing Phase: This phase employed rapid learning rate decay over high-quality and synthetic datasets. A blend of original pretraining data (60%) and new high-quality data (40%) facilitated improved tuning.

Zamba’s dataset for phase 1 involved minimal filtering and deduplication strategies. The annealing phase incorporated a curriculum learning approach, inspired by recent studies showing that high-quality data can dramatically enhance pretraining efficacy. Zamba's performance metrics were significantly improved in the annealing phase, reinforcing the notion that high-quality curated data can optimize LLM performance.

Evaluation and Results

Zamba was benchmarked against leading open models like Llama 2, Mistral, and Gemma across diverse linguistic and reasoning tasks. While surpassed slightly by these models, Zamba demonstrated substantial benchmark efficacy considering its fewer training tokens (~1T compared to 15T for some competitors). Zero-shot evaluation results highlight that Zamba's annealed model closely trails leading models, outperforming Llama 2 on several benchmarks and nearing the efficiency of top-tier models trained on closed datasets.

In inference and generation efficiency, Zamba excels. It demonstrates superior forward-pass latency and memory usage due to its compact GSA architecture and efficient Mamba kernels, positioning it as an attractive model for long-sequence processing.

Implications and Future Work

The results from Zamba showcase that hybrid architectures, utilizing a blend of SSM and transformers, are viable alternatives to pure transformers, particularly for applications necessitating efficient inference and memory usage. Zamba’s success with a modest training budget ($~200k) and limited computational resources underscores the accessibility of competitive LLM training beyond industry giants.

Future directions should explore:

  • Scaling the Zamba architecture beyond 7B parameters.
  • Extending annealing and pretraining datasets to enhance model robustness.
  • Investigating potential optimizations in the GSA block placement and its impact on long-range dependencies.

By releasing all training checkpoints, Zamba facilitates deeper research into learning dynamics and architectural impacts, fostering a more informed understanding of hybrid model training.

Conclusion

The Zamba model, introduced by Glorioso et al., stands out for its innovative use of a Mamba-based backbone combined with shared global self-attention, achieving competitive performance with notable efficiency in inference and memory usage. This model's release paves the way for more accessible, high-performance LLM development and offers a rich data source for academic and practical exploration into hybrid model architectures.

Youtube Logo Streamline Icon: https://streamlinehq.com