Emergent Mind

Retentive Network: A Successor to Transformer for Large Language Models

(2307.08621)
Published Jul 17, 2023 in cs.CL and cs.LG

Abstract

In this work, we propose Retentive Network (RetNet) as a foundation architecture for LLMs, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for LLMs. Code will be available at https://aka.ms/retnet.

RetNet achieves efficient inference, training parallelism, and better scaling compared to Transformers with 8k input length.

Overview

  • The Retentive Network (RetNet) architecture addresses the limitations of the Transformer architecture in training parallelism, inference cost, and performance for LLMs by introducing a retention mechanism that supports three computation paradigms: parallel, recurrent, and chunkwise recurrent representations.

  • Extensive experiments demonstrate that RetNet achieves significant improvements in GPU memory usage, decoding speed, memory consumption, and training speed compared to standard and FlashAttention-optimized Transformers.

  • The RetNet architecture's dual-form representation of recurrence and attention mechanisms opens the door for future advances in hybrid architectures, making it suitable for real-world applications, scalability enhancements, multimodal models, and edge computing.

Retentive Network: A Successor to Transformer for LLMs

The paper titled "Retentive Network: A Successor to Transformer for LLMs" introduces Retentive Network (RetNet), a novel architecture designed to address the limitations of the Transformer architecture in terms of training parallelism, inference cost, and performance, particularly for LLMs. RetNet is presented as a direct contender to Transformer, offering significant improvements in efficiency and scalability while maintaining, and sometimes exceeding, the performance of its predecessor.

Key Contributions

  1. Retention Mechanism: RetNet introduces a retention mechanism that supports three distinct computation paradigms: parallel, recurrent, and chunkwise recurrent representations. This flexibility allows the architecture to leverage parallelism for efficient training while utilizing recurrent mechanisms for inference, thereby reducing memory and computational costs.

  2. Theoretical Foundation: The paper establishes a theoretical connection between recurrence and attention mechanisms, paving the way for the introduction of the retention mechanism. This approach combines the strengths of recurrent neural networks (RNNs) and the attention mechanism, aiming to achieve the best of both worlds.

  3. Three Computation Paradigms:
  • Parallel Representation: This paradigm is utilized during training to fully leverage GPU parallelism.
  • Recurrent Representation: This supports $O(1)$ inference complexity, thereby reducing the inference cost significantly.
  • Chunkwise Recurrent Representation: This allows for efficient long-sequence modeling with linear complexity.
  1. Experimental Validation: Extensive experiments demonstrate that RetNet achieves favorable scaling results, efficient training parallelism, and low-cost inference. The results indicate that RetNet is a strong competitor to Transformer in terms of both performance and efficiency.

Numerical Results and Claims

RetNet shows a substantial reduction in GPU memory usage during inference, saving 70% of memory compared to Transformer with key-value (KV) caches when processing sequences of 8k tokens. Additionally, RetNet achieves an 8.4× improvement in decoding speed. During training, RetNet demonstrates a 25-50% reduction in memory consumption and a 7× boost in speed compared to standard Transformer models. Moreover, even when compared with FlashAttention-optimized Transformers, RetNet exhibits competitive or superior throughput and memory efficiency.

Implications and Future Directions

The implications of this research are significant both theoretically and practically:

Theoretical Implications:

The dual-form representation reinforces the connection between recurrent models and attention mechanisms. This alignment could inspire future advances in hybrid architectures that leverage these principles to further optimize performance and efficiency.

Practical Implications:

RetNet’s efficient training and inference paradigms make it highly suitable for deployment in real-world applications where resource constraints are a critical consideration. This could lead to more widespread adoption of LLMs in industry, particularly in scenarios requiring scalable and low-latency inference.

Speculative Outlook on AI Developments

Looking ahead, the introduction of RetNet could catalyze several developments within the AI field:

Scalability Enhancements:

Further optimizations in RetNet could facilitate even larger models with billions to trillions of parameters, driving advancements in model capability and performance.

Multimodal Models:

Since RetNet retains the advantageous properties of the Transformer architecture, it is well-positioned for integration into multimodal models that process and generate data across multiple formats, including text, images, and audio.

Edge Computing:

The efficiency gains in RetNet could enable the deployment of powerful language models on edge devices, expanding the possibilities for AI applications in mobile and remote contexts.

In conclusion, the Retentive Network represents a promising advancement in the domain of LLMs, seamlessly bridging the gap between the advantages of Transformers and the efficiency of recurrent mechanisms. The architecture's robust performance and significant efficiency improvements highlight its potential as a successor to Transformers, setting the stage for future breakthroughs in AI technology.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube