Linearizing Large Language Models (2405.06640v1)
Abstract: Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training LLMs requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023.
- In-context language learning: Arhitectures and algorithms. arXiv preprint arXiv:2401.12973, 2024.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Gpt-neox-20b: An open-source autoregressive language model, 2022. URL https://arxiv.org/abs/2204.06745.
- Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
- A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4599–4610, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.365. URL https://aclanthology.org/2021.naacl-main.365.
- Griffin: Mixing gated linear recurrences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024.
- emozilla. Dynamically scaled rope further increases strength of retaining walls, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. Reddit post.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Recurrentgemma. arXiv preprint arXiv:2404.07839, 2024.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- OpenLM: a minimal but performative language modeling (lm) repository, 2023. URL https://github.com/mlfoundations/open_lm/. GitHub repository.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Finetuning pretrained transformers into rnns. arXiv preprint arXiv:2103.13076, 2021.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020.
- The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl˙a˙00023. URL https://aclanthology.org/Q18-1023.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453, 2021.
- Jean Mercat. Higher order linear transformer. arXiv preprint arXiv:2010.14816, 2020.
- Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023a.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023b.
- Random feature attention. arXiv preprint arXiv:2103.02143, 2021.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866, 2023.
- Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023.
- The devil in linear transformer. arXiv preprint arXiv:2210.10340, 2022a.
- cosformer: Rethinking softmax in attention. arXiv preprint arXiv:2202.08791, 2022b.
- Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. arXiv preprint arXiv:2401.04658, 2024.
- SCROLLS: Standardized CompaRison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 12007–12021, Abu Dhabi, United Arab Emirates, December 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.823.
- Scrolls: Standardized comparison over long language sequences. arXiv preprint arXiv:2201.03533, 2022b.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023.
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, pp. 10–19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450367196. doi: 10.1145/3315508.3329973. URL https://doi.org/10.1145/3315508.3329973.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Group normalization. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.
- Effective long-context scaling of foundation models, 2023.
- An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
- The hedgehog & the porcupine: Expressive linear attentions with softmax mimicry. arXiv preprint arXiv:2402.04347, 2024.