MambaByte: Token-free Selective State Space Model (2401.13660v3)
Abstract: Token-free LLMs learn directly from raw bytes and remove the inductive bias of subword tokenization. Operating on bytes, however, results in significantly longer sequences. In this setting, standard autoregressive Transformers scale poorly as the effective memory required grows with sequence length. The recent development of the Mamba state space model (SSM) offers an appealing alternative approach with a fixed-sized memory state and efficient decoding. We propose MambaByte, a token-free adaptation of the Mamba SSM trained autoregressively on byte sequences. In terms of modeling, we show MambaByte to be competitive with, and even to outperform, state-of-the-art subword Transformers on LLMing tasks while maintaining the benefits of token-free LLMs, such as robustness to noise. In terms of efficiency, we develop an adaptation of speculative decoding with tokenized drafting and byte-level verification. This results in a $2.6\times$ inference speedup to the standard MambaByte implementation, showing similar decoding efficiency as the subword Mamba. These findings establish the viability of SSMs in enabling token-free LLMing.
- Attention Is All You Need. Advances in neural information processing systems, 30, 2017.
- Roformer: Enhanced transformer with rotary position embedding. arXiv e-prints, pages arXiv–2104, 2021.
- MegaByte: Predicting Million-byte Sequences with Multiscale Transformers. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=JTmO2V9Xpz.
- Long Range Language Modeling via Gated State Spaces. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=5MkYIYCbva.
- A Neural Probabilistic Language Model. Advances in neural information processing systems, 13, 2000.
- Japanese and Korean Voice Search. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5149–5152. IEEE, 2012.
- Neural Machine Translation of Rare Words with Subword Units. arXiv preprint arXiv:1508.07909, 2015.
- Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.
- Neural Machine Translation with Byte-Level Subwords. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 9154–9160, 2020.
- Character-Level Translation with Self-attention. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1591–1604, 2020a.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022.
- Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. Transactions of the Association for Computational Linguistics, 10:73–91, 2022.
- Language Models are Few-Shot Learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
- OPT: Open Pre-trained Transformer Language Models. arXiv preprint arXiv:2205.01068, 2022.
- Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing. Advances in neural information processing systems, 33:4271–4282, 2020.
- Hierarchical Transformers Are More Efficient Language Models. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1559–1571, 2022.
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently Modeling Long Sequences with Structured State Spaces. arXiv preprint arXiv:2111.00396, 2021.
- Diagonal State Spaces are as Effective as Structured State Spaces. Advances in Neural Information Processing Systems, 35:22982–22994, 2022.
- On the Parameterization and Initialization of Diagonal State Space Models. Advances in Neural Information Processing Systems, 35:35971–35983, 2022.
- Simplified State Space Layers for Sequence Modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=Ai8Hw3AXqks.
- Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
- Guy E Blelloch. Prefix Sums and Their Applications. (CMU-CS-90-190), nov 1990. URL https://www.cs.cmu.edu/~guyb/papers/Ble93.pdf.
- Compressive Transformers for Long-Range Sequence Modelling. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
- A Simple Method for Commonsense Reasoning. arXiv preprint arXiv:1806.02847, 2018.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020b.
- Shortformer: Better Language Modeling using Shorter Inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.427. URL https://aclanthology.org/2021.acl-long.427.
- Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021. doi: 10.1162/tacl_a_00353. URL https://aclanthology.org/2021.tacl-1.4.
- General-purpose, long-context autoregressive modeling with Perceiver AR. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 8535–8558. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/hawthorne22a.html.
- Block-Recurrent Transformers. Advances in Neural Information Processing Systems, 35:33248–33261, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Resurrecting Recurrent Neural Networks for Long Sequences. arXiv preprint arXiv:2303.06349, 2023.
- How to Train your HiPPO: State Space Models with Generalized Orthogonal Basis Projections. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=klK17OQ3KB.
- S4ND: Modeling Images and Videos as Multidimensional Signals with State Spaces. Advances in neural information processing systems, 35:2846–2861, 2022.
- The Curious Case of Neural Text Degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.