Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

RWKV: Reinventing RNNs for the Transformer Era (2305.13048v2)

Published 22 May 2023 in cs.CL and cs.AI

Abstract: Transformers have revolutionized almost all NLP tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

References (95)

Citations (407)

View on Semantic Scholar

Summary

The paper presents the RWKV model that blends RNN efficiency with Transformer strengths through a reformulated linear attention mechanism.
It employs a receptance mechanism and parallelizable training to handle long-range dependencies with constant inference cost.
Experiments on 12 NLP tasks with models up to 14B parameters demonstrate scalable performance and competitiveness with traditional Transformers.

RWKV: Reinventing RNNs for the Transformer Era

The landscape of NLP has been dramatically reshaped by the advent of Transformer models, with their self-attention mechanism enabling unparalleled advancements in various tasks. Despite their success, Transformers come with intrinsic limitations, most notably their quadratic computational and memory complexities concerning sequence length. On the other hand, Recurrent Neural Networks (RNNs) exhibit linear scaling but falter in performance due to non-parallelizability and scalability issues. This paper introduces a novel architecture termed Receptance Weighted Key Value (RWKV), aiming to meld the strengths of both RNNs and Transformers while mitigating their respective limitations.

Research Motivation and Approach

Transformers' impact on NLP tasks is profound, but their scalability is hindered by the quadratic complexity of their self-attention mechanism. However, the linear scaling of memory and computation in RNNs presents an alluring alternative if the performance bottleneck can be overcome. RWKV leverages linear attention mechanisms, reformulating the model to function as either a Transformer or an RNN. This dual functionality allows RWKV to harness parallelizable computation during training while maintaining constant computational and memory complexity during inference.

The core architecture of RWKV integrates:

Linear Attention: Reformulating attention mechanisms to achieve linear, rather than quadratic, complexity.
Receptance Mechanism: Incorporating channel-directed attention to enable efficient handling of long-range dependencies.
Parallelizable Training: Leveraging Transformer-like parallel training.
Efficient Inference: Utilizing RNN-like constant-complexity inference.

Experimental Validation

RWKV models, scaled up to 14 billion parameters, exhibit performance parity with similarly-sized Transformers, demonstrating RWKV's competitive edge without the quadratic scaling drawback. Specific evaluations across twelve NLP tasks, such as ARC Challenge and LAMBADA, illustrate RWKV's efficient performance, as encapsulated in model architecture diagrams and result plots. These results underscore RWKV's potential as a computationally efficient model for handling vast parameter spaces effectively.

Performance and Complexity

A pivotal advantage of RWKV lies in its computational efficiency:

Time and Space Complexity: Traditional Transformers operate with complexities of $O(T^2d)$ and $O(T^2 + Td)$ , respectively. Conversely, RWKV boasts a complexity of $O(Td)$ for both time and space, thus significantly reducing the computational overhead.
Scalability: The models ranging from 169 million to 14 billion parameters trained on extensive datasets demonstrate effective scaling without prohibitive computational costs.

Future Directions and Implications

RWKV's innovative architecture bridges a crucial gap between computational efficiency and representational capacity, presenting a framework that could potentially redefine AI models' scalability in sequence processing. These sustainable and cost-effective models enable broader deployment in resource-constrained environments, heralding significant implications for both practical applications and theoretical research.

Speculative Developments

Looking forward, enhancements in RWKV could include:

Improving Time-Decay Formulations: Refining the mechanisms that dictate the relevance of past information.
Cross-Attention Substitution: Replacing traditional cross-attention mechanisms in encoder-decoder architectures with RWKV-style computations.
Customizability through Prompt Tuning: Exploring the manipulation of hidden states to refine behavior predictability and model interpretability.
Expanded State Memory: Increasing the internal state capacity to enhance long-range dependency modeling.

Conclusions

RWKV represents a significant advancement in neural network design, uniting RNN and Transformer advantages while curtailing their limitations. By reformulating attention and leveraging channel-directed mechanisms, RWKV achieves linear computational complexity, making it a compelling choice for large-scale sequence processing tasks. This contribution lays a foundation for more efficient and sustainable AI models, potentially transforming how we approach and deploy large-scale AI systems.

In conclusion, RWKV paves the way for the next generation of efficient and scalable neural architectures, striking a critical balance between performance and computational feasibility. This architecture's ability to manage extensive parameter spaces with constrained resources portends a promising avenue for future developments within the NLP and broader AI communities.

PDF Markdown

GitHub

GitHub - BlinkDL/RWKV-LM: RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best of RNN and transformer - great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding. (11,836 stars)

Tweets

https://twitter.com/1561918448766173185/status/1736260370426114466

https://twitter.com/babskhalidson/status/1752064133296804251

https://twitter.com/BlancheMinerva/status/1750782611356516418

https://twitter.com/picocreator/status/1752104455427006858

https://twitter.com/BlancheMinerva/status/1758149906076774745

https://twitter.com/GoAbiAryan/status/1752407905289314400

YouTube

Show All Videos