RWKV: Reinventing RNNs for the Transformer Era (2305.13048v2)
Abstract: Transformers have revolutionized almost all NLP tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.
- Recasting self-attention with holographic reduced representations. arXiv preprint arXiv:2305.19534.
- FETA: A benchmark for few-sample task transfer in open-domain dialogue. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10936–10953, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Gpt4all: Training an assistant-style chatbot with large scale data distillation from gpt-3.5-turbo. https://github.com/nomic-ai/gpt4all.
- Anonymous. 2023. Sharegpt_vicuna_unfiltered.
- Layer normalization.
- TweetEval: Unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1644–1650, Online. Association for Computational Linguistics.
- Longformer: The long-document transformer. arXiv:2004.05150.
- Datasheet for the pile. arXiv preprint arXiv:2201.07311.
- Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158.
- Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow. URL: https://doi. org/10.5281/zenodo, 5297715.
- Gpt-neox-20b: An open-source autoregressive language model. In Proceedings of BigScience Episode\normal-\\backslash\# 5–Workshop on Challenges & Perspectives in Creating Large Language Models, pages 95–136.
- Quasi-recurrent neural networks. In ICLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Scaling transformer to 1m tokens and beyond with rmt.
- Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091.
- Sahil Chaudhary. 2023. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca.
- Joseph Cheung. 2023. Guanacodataset.
- Rethinking attention with performers.
- Palm: Scaling language modeling with pathways. CoRR, abs/2204.02311.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv:1803.05457.
- Training verifiers to solve math word problems. In arXiv, volume abs/2110.14168.
- Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems.
- Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052.
- Goemotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 4040–4054. Association for Computational Linguistics.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396.
- Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR).
- Longt5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 724–736.
- Diagonal state spaces are as effective as structured state spaces. Advances in Neural Information Processing Systems, 35:22982–22994.
- Identity mappings in deep residual networks.
- Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701.
- Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Training compute-optimal large language models.
- Deep learning for time series classification: a review. Data mining and knowledge discovery, 33(4):917–963.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR.
- Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In Proceedings of the 46th International Symposium on Computer Architecture, pages 250–263.
- Belle: Be everyone’s large language model engine. https://github.com/LianjiaTech/BELLE.
- Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases. arXiv preprint arXiv:2303.14742.
- Matt Gardner Johannes Welbl Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions. In DOI:10.18653/v1/W17-4413.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR.
- Reformer: The efficient transformer. ArXiv, abs/2001.04451.
- Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861.
- Multi-level sentiment analysis of polemo 2.0: Extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 980–991.
- Phong Le and Willem Zuidema. 2016. Quantifying the vanishing gradient and long distance dependency problem in recursive neural networks and recursive lstms. In Proceedings of the 1st Workshop on Representation Learning for NLP, pages 87–93.
- What language model to train if you have one million gpu hours? In Proceedings of BigScience Episode #5–Workshop on Challenges & Perspectives in Creating Large Language Models.
- Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4470–4481, Brussels, Belgium. Association for Computational Linguistics.
- Pay attention to mlps.
- Luna: Linear unified nested attention. Advances in Neural Information Processing Systems, 34:2441–2453.
- Mega: Moving average equipped gated attention. In ICLR.
- Eric Martin and Chris Cundy. 2017. Parallelizing linear recurrent neural nets over sequence length. ArXiv, abs/1709.04057.
- Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems, 36.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
- OpenAI. 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
- Resurrecting recurrent neural networks for long sequences. arXiv preprint arXiv:2303.06349.
- The LAMBADA dataset: Word prediction requiring a broad discourse context. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1525–1534, Berlin, Germany. Association for Computational Linguistics.
- Pytorch: An imperative style, high-performance deep learning library.
- Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866.
- Six attributes of unhealthy conversations. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 114–124, Online. Association for Computational Linguistics.
- Markus N. Rabe and Charles Staats. 2022. Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Ramsha Siddiqui. 2019. SARCASMANIA: Sarcasm Exposed! http://www.kaggle.com/rmsharks4/sarcasmania-dataset. [Online; accessed 02-February-2023].
- Primer: Searching for efficient transformers for language modeling. CoRR, abs/2109.08668.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Synthesizer: Rethinking self-attention in transformer models.
- Long range arena : A benchmark for efficient transformers. In International Conference on Learning Representations.
- Efficient transformers: A survey. ACM Computing Surveys, 55(6):1–28.
- Mlp-mixer: An all-mlp architecture for vision. CoRR, abs/2105.01601.
- Llama: Open and efficient foundation language models.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- David Vilares and Carlos Gómez-Rodríguez. 2019. Head-qa: A healthcare dataset for complex reasoning. In ACL.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
- Linformer: Self-attention with linear complexity.
- Emergent abilities of large language models. ArXiv, abs/2206.07682.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Transformers: State-of-the-Art Natural Language Processing. pages 38–45. Association for Computational Linguistics.
- A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24.
- Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pages 1391–1399. ACM.
- Jianxin Yang. 2023. Firefly. https://github.com/yangjianxin1/Firefly.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33.
- Hellaswag: Can a machine really finish your sentence? In ACL.
- Winogrande: An adversarial winograd schema challenge at scale. In ACL.
- An attention free transformer.
- Record: Bridging the gap between human and machine commonsense reading comprehension. In arXiv:1810.12885.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.