Resurrecting Recurrent Neural Networks for Long Sequences (2303.06349v1)

Published 11 Mar 2023 in cs.LG

Abstract: Recurrent Neural Networks (RNNs) offer fast inference on long sequences but are hard to optimize and slow to train. Deep state-space models (SSMs) have recently been shown to perform remarkably well on long sequence modeling tasks, and have the added benefits of fast parallelizable training and RNN-like fast inference. However, while SSMs are superficially similar to RNNs, there are important differences that make it unclear where their performance boost over RNNs comes from. In this paper, we show that careful design of deep RNNs using standard signal propagation arguments can recover the impressive performance of deep SSMs on long-range reasoning tasks, while also matching their training speed. To achieve this, we analyze and ablate a series of changes to standard RNNs including linearizing and diagonalizing the recurrence, using better parameterizations and initializations, and ensuring proper normalization of the forward pass. Our results provide new insights on the origins of the impressive performance of deep SSMs, while also introducing an RNN block called the Linear Recurrent Unit that matches both their performance on the Long Range Arena benchmark and their computational efficiency.

Citations (208)

View on Semantic Scholar

Summary

The paper introduces the Linear Recurrent Unit (LRU), which modifies RNNs by linearizing recurrence to boost performance on long sequence tasks.
It employs diagonal parameterization and exponential mapping to optimize training stability and computational efficiency, mitigating vanishing and exploding gradients.
Experimental findings suggest that the LRU architecture enables deep RNNs to achieve performance competitive with SSMs and Transformers while reducing training overhead.

Resurrecting Recurrent Neural Networks for Long Sequences

Recurrent Neural Networks (RNNs) have long been recognized for their theoretical capabilities, especially in modeling sequential data. Despite their potential, practical challenges like the vanishing and exploding gradient problems have impeded their scalability and optimization. Consequently, deep learning has shifted towards alternative architectures such as Transformers, which offer benefits like parallelizable training and avoidance of gradient descent issues. However, Transformers face computational inefficiencies, especially with long sequences, due to their quadratic scaling in memory and runtime costs with sequence length.

Recent advances in sequence modeling have seen Deep State-Space Models (SSMs) emerge as strong contenders. SSMs demonstrate superior performance on long sequence tasks and offer computational efficiencies akin to RNNs but with faster parallelizable training. However, the nuances contributing to the performance enhancements of SSMs—despite their superficial resemblance to RNNs—remain ambiguous.

The paper presented in "Resurrecting Recurrent Neural Networks for Long Sequences" tackles this ambiguity by introducing modifications to RNNs, transforming them into an architecture called the Linear Recurrent Unit (LRU). The paper is guided by two questions: Can deep RNNs achieve the impressive performance of SSMs, and how can they be optimized for long sequence tasks?

Key Contributions and Findings

Linear Recurrence Utilization: The paper discovers that by linearizing the recurrence in RNNs—eschewing standard nonlinear activations like tanh or ReLU—substantial performance improvements can be realized. This linear approach doesn't harm expressivity due to the non-linearity already present in the overall model architecture through interleaved MLP blocks, which facilitates complex sequence-to-sequence mappings.
Diagonalization and Efficiency: By representing the RNN in a diagonal complex form, significant computational speedups are achieved, making training on large sequences feasible without sacrificing performance. This diagonal form preserves the spectrum of initialization common in Glorot-initialized networks, adhering to the distribution predicted by the strong circular law of random matrix theory.
Stable Exponential Parameterization: The use of exponential mapping for recurrent matrix parameterization confers both stability and optimization advantages. It permits control over eigenvalue magnitudes, reducing stability issues prevalent in long-range dependencies. Stability is further enhanced by enforcing bounds on eigenvalue magnitudes through parameterization.
Normalization of Hidden Activations: The paper identifies the need for normalization of hidden activations to manage forward pass blow-ups when initializing eigenvalues close to the unit circle. This normalization ensures efficient learning across sequence lengths, addressing tasks with extensive temporal dependencies.
Modifications for Extremely Long Sequences: For tasks involving very long sequences, initialization with a smaller eigenvalue phase was efficacious. This behavior aligns with theoretical expectations on the impact of initial phases in controlling signal processing across sequence components.

Implications and Future Directions

The findings of this paper suggest avenues to revitalize RNNs as a central architecture for sequence modeling, potentially rivaling both SSMs and Transformers in specific domains. The LRU demonstrates that with appropriate modifications—emphasizing linear approaches, stable parameterizations, and thoughtful normalization—RNNs can excel in long sequence tasks without the computational overhead typical of Transformers.

From a theoretical perspective, the paper challenges the preconceived necessity of nonlinearities in recurrent layers, opening a dialogue on the essential architectural components needed for long-range sequence modeling. Moreover, it offers an alternative lens to interpret the successes of deep SSMs, attributing performance not to structured complexities or theoretical underpinnings such as HiPPO, but rather to fundamental initialization, parameterization, and optimization strategies.

As research progresses, further exploration into the integration of such insights with existing models can drive innovation in general-purpose architectures for time-series and sequence-based applications. These insights might propel advancements in areas like natural language processing, time-series forecasting, and beyond, where RNNs could reclaim a significant role armed with modern, performance-oriented refinements.

PDF Markdown

Related Papers

Tweets

https://twitter.com/IgorMezic/status/1772293138167566608

https://twitter.com/_ueaj/status/1911525151314698517

https://twitter.com/Tsarpf/status/1784881296843399525

https://twitter.com/MrCatid/status/1745911193703780537

https://twitter.com/tummycom/status/1756378874865193305

https://twitter.com/_ueaj/status/1909440675474530539

YouTube

Show All Videos