Emergent Mind

HGRN2: Gated Linear RNNs with State Expansion

(2404.07904)
Published Apr 11, 2024 in cs.CL

Abstract

Hierarchically gated linear RNN (HGRN,Qin et al. 2023) has demonstrated competitive training speed and performance in language modeling, while offering efficient inference. However, the recurrent state size of HGRN remains relatively small, which limits its expressiveness.To address this issue, inspired by linear attention, we introduce a simple outer-product-based state expansion mechanism so that the recurrent state size can be significantly enlarged without introducing any additional parameters. The linear attention form also allows for hardware-efficient training.Our extensive experiments verify the advantage of HGRN2 over HGRN1 in language modeling, image classification, and Long Range Arena.Our largest 3B HGRN2 model slightly outperforms Mamba and LLaMa Architecture Transformer for language modeling in a controlled experiment setting; and performs competitively with many open-source 3B models in downstream evaluation while using much fewer total training tokens.

HGRN2 neural architecture combines token mixer HGRU2 and channel mixer GLU with recurrent computation.

Overview

  • HGRN2 introduces a significant improvement over its predecessor HGRN by incorporating an outer-product-based state expansion mechanism, enhancing expressiveness and efficiency.

  • This advancement allows for a major increase in the recurrent state size without the addition of extra parameters, leveraging linear attention models for inspiration.

  • HGRN2 demonstrates superior performance on various benchmarks, including language modeling and image classification, outperforming its predecessor and showing competitiveness with state-of-the-art models.

  • The paper underscores the untapped potential of linear RNN architectures in achieving high performance with computational efficiency, heralding a new era of scalable and efficient RNN designs.

Enhancing Linear RNNs with State Expansion: The Introduction of HGRN2

Introduction to HGRN2

The Hierarchically Gated Linear RNN (HGRN) architecture has previously shown promise in language modeling and efficient inference through its use of recurrent neural networks (RNNs) with linear inference complexity. However, its performance has been somewhat constrained by its relatively small recurrent state size. In a recent development, researchers have proposed HGRN2, an advancement over HGRN, which significantly increases the recurrent state size without adding extra parameters. This is achieved through an innovative outer-product-based state expansion mechanism inspired by linear attention models, enhancing both the model's expressiveness and efficiency. HGRN2 exhibits impressive performance improvements over its predecessor across several benchmarks, including language modeling, image classification, and the Long Range Arena.

Motivation and Background

The fundamental challenge addressed by HGRN2 pertains to the limitations of fixed-sized recurrent states in RNNs. To enhance the utility of these states, two main strategies are essential: utilizing data-dependent decays for selective information retention and increasing the recurrent state size. While HGRN made strides in employing data-dependent decays, its fixed state size limited performance scalability. State expansion emerges as a critical technique in overcoming this barrier, as demonstrated by several contemporary models like Mamba and LLaMa. HGRN2 builds upon these insights, focusing on state expansion to elevate model performance without compromising efficiency.

HGRN2: Key Innovations

HGRN2 introduces several significant improvements over HGRN1, detailed as follows:

  • State Expansion Through Outer Products: HGRN2 leverages a nonparametric outer-product-based mechanism to expand the recurrent state size effectively. This approach facilitates a substantial increase in state size without the need for additional parameters, thus maintaining parameter efficiency.
  • Efficient Training and Inference: Inspired by the linear attention form, HGRN2 adopts a hardware-efficient training algorithm that allows for accelerated computation without compromising model scalability or performance.
  • Robust Empirical Evaluation: Through extensive experiments across various benchmarks, HGRN2 not only outperforms HGRN1 but also achieves competitive results against state-of-the-art models, including Mamba and LLaMa architectures in language modeling.
  • Scalability and Efficiency: One of the standout features of HGRN2 is its ability to scale efficiently, as demonstrated in controlled experiments on large-scale settings. With its design, HGRN2 exhibits potential for further scalability and utility in more demanding applications.

Practical Implications and Theoretical Contributions

HGRN2’s introduction of state expansion via a simple outer product represents a nuanced shift in enhancing RNNs' capacity for language modeling and beyond. This approach underscores the untapped potential of linear RNN architectures in achieving high performance with computational efficiency. The practical implications of HGRN2 are profound, especially in applications where inference speed and model scalability are critical. Moreover, the theoretical underpinnings of HGRN2 offer fresh perspectives on harnessing the power of RNNs through methodical state expansion, setting a new benchmark for subsequent research in this domain.

Conclusion and Future Directions

HGRN2 marks a significant step forward in the evolution of RNNs, balancing the dual objectives of enhancing model expressiveness while maintaining efficiency. By addressing the limitations of its predecessor through state expansion, HGRN2 paves the way for more sophisticated and scalable RNN architectures. Future research will likely explore further optimizations in state expansion techniques and apply HGRN2’s principles to a broader range of applications, from natural language processing to complex multimodal tasks, opening up new frontiers in the field of generative AI.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.