Zamba: A Compact 7B SSM Hybrid Model (2405.16712v1)
Abstract: In this technical report, we present Zamba, a novel 7B SSM-transformer hybrid model which achieves competitive performance against leading open-weight models at a comparable scale. Zamba is trained on 1T tokens from openly available datasets and is the best non-transformer model at this scale. Zamba pioneers a unique architecture combining a Mamba backbone with a single shared attention module, thus obtaining the benefits of attention at minimal parameter cost. Due to its architecture, Zamba is significantly faster at inference than comparable transformer models and requires substantially less memory for generation of long sequences. Zamba is pretrained in two phases: the first phase is based on existing web datasets, while the second one consists of annealing the model over high-quality instruct and synthetic datasets, and is characterized by a rapid learning rate decay. We open-source the weights and all checkpoints for Zamba, through both phase 1 and annealing phases.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219.
- AI2 (2024). OLMo 1.7–7B: A 24 point improvement on MMLU. https://blog.allenai.org/olmo-1-7-7b-a-24-point-improvement-on-mmlu-92b43f7d269d. Accessed: May 26, 2024.
- Blackmamba: Mixture of experts for state-space models. arXiv preprint arXiv:2402.01771.
- Scaling mlps: A tale of inductive bias. Advances in Neural Information Processing Systems, 36.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning.
- Griffin: Mixing gated linear recurrences with local attention for efficient language models.
- Universal transformers. arXiv preprint arXiv:1807.03819.
- An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929.
- Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48(1):71–99.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- A framework for few-shot language model evaluation.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295.
- Is mamba capable of in-context learning?
- Olmo: Accelerating the science of language models. arXiv preprint arXiv:2402.00838.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
- Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644.
- Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014.
- Measuring massive multitask language understanding.
- Training compute-optimal large language models.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
- Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.
- Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
- Repeat after me: Transformers are better than state space models at copying.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Downstream datasets make surprisingly good pretraining corpora. In The 61st Annual Meeting Of The Association For Computational Linguistics.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463.
- Jamba: A hybrid transformer-mamba language model.
- Curriculum learning for natural answer generation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4223–4229.
- Llm360: Towards fully transparent open-source llms. arXiv preprint arXiv:2312.06550.
- Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.
- NVIDIA (2023). Transformer Engine: A library for accelerating Transformer models on NVIDIA GPUs. https://github.com/NVIDIA/TransformerEngine. Accessed: May 26, 2024.
- Can mamba learn how to learn? a comparative study on in-context learning tasks.
- Nemotron-4 15b technical report. arXiv preprint arXiv:2402.16819.
- Nemotron-4 15b technical report.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Rwkv: Reinventing rnns for the transformer era.
- Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.
- Mechanistic design and scaling of hybrid architectures. arXiv preprint arXiv:2403.17844.
- Scaling language models: Methods, analysis & insights from training gopher.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
- Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations.
- Jetmoe: Reaching llama2 performance with 0.1 m dollars. arXiv preprint arXiv:2404.07413.
- peS2o (Pretraining Efficiently on S2ORC) Dataset. Technical report, Allen Institute for AI. ODC-By, https://github.com/allenai/pes2o.
- Retentive network: A successor to transformer for large language models.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- D4: Improving llm pretraining via document de-duplication and diversification. Advances in Neural Information Processing Systems, 36.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Hybrid predictive coding: Inferring, fast and slow. PLoS Computational Biology, 19(8):e1011280.
- Going in circles is the way forward: the role of recurrence in visual inference. Current Opinion in Neurobiology, 65:176–193.
- Attention is all you need. Advances in neural information processing systems, 30.
- The Tolman-Eichenbaum Machine: Unifying Space and Relational Memory through Generalization in the Hippocampal Formation. Cell, 183(5):1249–1263.e23. Publisher: Elsevier Inc.
- Relating transformers to models and neural representations of the hippocampal formation. International Conference on Learning Representations.
- Doremi: Optimizing data mixtures speeds up language model pretraining. Advances in Neural Information Processing Systems, 36.
- Alphafold2 and its applications in the fields of biology and medicine. Signal Transduction and Targeted Therapy, 8(1):115.
- Zyphra (2024). Reproduction of Mamba-370M by Zyphra. https://huggingface.co/Zyphra/Mamba-370M. Accessed: May 26, 2024.
- Paolo Glorioso (32 papers)
- Quentin Anthony (25 papers)
- Yury Tokpanov (6 papers)
- James Whittington (3 papers)
- Jonathan Pilault (15 papers)
- Adam Ibrahim (12 papers)
- Beren Millidge (49 papers)