Emergent Mind

Abstract

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

Hawk and Griffin models outperform Transformer baselines on longer sequences in book evaluation set.

Overview

  • The paper introduces Hawk and Griffin, two novel architectures that combine the efficiency of RNNs with the performance of Transformer models, aimed at handling long sequences in NLP tasks more efficiently.

  • Hawk utilizes a gated linear recurrent unit (RG-LRU) for efficient scaling, while Griffin integrates RG-LRU with local attention to effectively process immediate context in long sequences.

  • In evaluations, Hawk surpasses existing recurrent models, and Griffin matches or slightly exceeds the performance of top Transformer models with less training data, highlighting their efficiency and ability to deal with long dependencies.

  • These models offer a new direction for developing resource-efficient language models and prompt a reassessment of the reliance on global attention mechanisms in NLP.

Efficient Scaling of Language Models with Hawk and Griffin: Bridging RNNs and Local Attention

Introduction

The landscape of NLP has notably shifted towards Transformer models due to their remarkable ability to utilize modern hardware efficiently and achieve superior performance across a wide array of tasks. Despite their advantages, the scalability of Transformers, especially concerning sequence lengths, remains constrained by the quadratic complexity associated with global attention mechanisms. This paper introduces two novel architectures: Hawk, centered on a gated linear recurrent unit named RG-LRU, and Griffin, a hybrid model integrating RG-LRU with local attention. These models not only embody the efficiency of RNNs for handling long sequences but also maintain competitive performance levels comparable to large Transformers, even when trained on significantly fewer tokens.

Model Architecture

The core of the presented work lies in the innovative use of RG-LRU, a gated linear recurrent layer designed to efficiently process sequences. This design choice facilitates a model that can scale efficiently, akin to Transformer models, but with a more effective management of long sequences. The paper delineates the architecture of both Hawk and Griffin, with Griffin uniquely combining the strengths of local attention mechanisms and RG-LRU to efficiently manage sequence-related tasks.

  • Hawk relies entirely on the RG-LRU layer for temporal mixing, showcasing an ability to efficiently scale and adapt to increasingly long sequences.
  • Griffin emerges as a hybrid, incorporating the RG-LRU layer alongside segments of local attention to better handle recent information in sequence processing tasks. This design enables Griffin to leverage the spatial efficiency of RNNs while harnessing the modeling capabilities of local attention for tasks requiring acute awareness of immediate context.

Evaluation and Performance

The evaluation of Hawk and Griffin unfolds across multiple dimensions, from held-out loss and hardware efficiency to throughput and latency during inference. Notably, Hawk outperforms existing recurrent models like Mamba on downstream tasks, even with significantly less training data. Griffin, despite its reduced training data footprint, matches or slightly surpasses the performance of the widely recognized Llama-2 Transformer model.

One of the standout findings is the models' ability to efficiently extrapolate beyond the sequence lengths observed during training, underscoring their potential for handling tasks characterized by long dependencies. This capability is particularly pronounced in Griffin, which balances the memory efficiency of RNNs with the contextual richness provided by local attention.

Implications and Future Directions

The implications of this work are twofold. Practically, Hawk and Griffin offer a pathway to more resource-efficient training and inference in LLMs, especially pertinent for sequences of extended lengths. Theoretically, these architectures contribute to the ongoing discourse on the optimal balance between global and local processing mechanisms in sequence modeling.

Looking ahead, the scalability and efficiency demonstrated by Hawk and Griffin prompt a reconsideration of the prevailing reliance on global attention mechanisms, especially for tasks where sequence length poses a distinct challenge. Further exploration of hybrid models, as exemplified by Griffin, may yield even more efficient architectures capable of navigating the trade-offs between computational resources, sequence length, and performance.

Conclusion

In summary, this paper presents a critical advancement in the understanding and application of recurrent neural networks for efficient language modeling. Hawk and Griffin not only challenge the current Transformer-dominated paradigm by offering comparable performance but also illuminate a path forward for the development of models that can more adeptly manage long sequences. As the field of NLP continues to evolve, the exploration of such efficient, scalable architectures will undoubtedly play a pivotal role in shaping future research directions and applications.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube