Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Retentive Network: A Successor to Transformer for Large Language Models (2307.08621v4)

Published 17 Jul 2023 in cs.CL and cs.LG

Abstract: In this work, we propose Retentive Network (RetNet) as a foundation architecture for LLMs, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for LLMs. Code will be available at https://aka.ms/retnet.

Citations (223)

Summary

  • The paper introduces RetNet’s retention mechanism that integrates parallel, recurrent, and chunkwise recurrent paradigms, achieving up to 70% GPU memory savings and 8.4× faster decoding.
  • The paper establishes a theoretical link between recurrence and attention, merging the strengths of RNNs and Transformers for enhanced model performance.
  • The paper demonstrates practical benefits for large language models by reducing training memory, accelerating processing speed, and enabling scalable real-world applications.

Retentive Network: A Successor to Transformer for LLMs

The paper "Retentive Network: A Successor to Transformer for LLMs" introduces Retentive Network (RetNet), a novel architecture designed to address the limitations of the Transformer architecture in terms of training parallelism, inference cost, and performance, particularly for LLMs. RetNet is presented as a direct contender to Transformer, offering significant improvements in efficiency and scalability while maintaining, and sometimes exceeding, the performance of its predecessor.

Key Contributions

  1. Retention Mechanism: RetNet introduces a retention mechanism that supports three distinct computation paradigms: parallel, recurrent, and chunkwise recurrent representations. This flexibility allows the architecture to leverage parallelism for efficient training while utilizing recurrent mechanisms for inference, thereby reducing memory and computational costs.
  2. Theoretical Foundation: The paper establishes a theoretical connection between recurrence and attention mechanisms, paving the way for the introduction of the retention mechanism. This approach combines the strengths of recurrent neural networks (RNNs) and the attention mechanism, aiming to achieve the best of both worlds.
  3. Three Computation Paradigms:
    • Parallel Representation: This paradigm is utilized during training to fully leverage GPU parallelism.
    • Recurrent Representation: This supports O(1)O(1) inference complexity, thereby reducing the inference cost significantly.
    • Chunkwise Recurrent Representation: This allows for efficient long-sequence modeling with linear complexity.
  4. Experimental Validation: Extensive experiments demonstrate that RetNet achieves favorable scaling results, efficient training parallelism, and low-cost inference. The results indicate that RetNet is a strong competitor to Transformer in terms of both performance and efficiency.

Numerical Results and Claims

RetNet shows a substantial reduction in GPU memory usage during inference, saving 70% of memory compared to Transformer with key-value (KV) caches when processing sequences of 8k tokens. Additionally, RetNet achieves an 8.4× improvement in decoding speed. During training, RetNet demonstrates a 25-50% reduction in memory consumption and a 7× boost in speed compared to standard Transformer models. Moreover, even when compared with FlashAttention-optimized Transformers, RetNet exhibits competitive or superior throughput and memory efficiency.

Implications and Future Directions

The implications of this research are significant both theoretically and practically:

  • Theoretical Implications:

The dual-form representation reinforces the connection between recurrent models and attention mechanisms. This alignment could inspire future advances in hybrid architectures that leverage these principles to further optimize performance and efficiency.

  • Practical Implications:

RetNet’s efficient training and inference paradigms make it highly suitable for deployment in real-world applications where resource constraints are a critical consideration. This could lead to more widespread adoption of LLMs in industry, particularly in scenarios requiring scalable and low-latency inference.

Speculative Outlook on AI Developments

Looking ahead, the introduction of RetNet could catalyze several developments within the AI field:

  • Scalability Enhancements:

Further optimizations in RetNet could facilitate even larger models with billions to trillions of parameters, driving advancements in model capability and performance.

  • Multimodal Models:

Since RetNet retains the advantageous properties of the Transformer architecture, it is well-positioned for integration into multimodal models that process and generate data across multiple formats, including text, images, and audio.

  • Edge Computing:

The efficiency gains in RetNet could enable the deployment of powerful LLMs on edge devices, expanding the possibilities for AI applications in mobile and remote contexts.

In conclusion, the Retentive Network represents a promising advancement in the domain of LLMs, seamlessly bridging the gap between the advantages of Transformers and the efficiency of recurrent mechanisms. The architecture's robust performance and significant efficiency improvements highlight its potential as a successor to Transformers, setting the stage for future breakthroughs in AI technology.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 12 tweets and received 489 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube