Emergent Mind

Transformers are Multi-State RNNs

(2401.06104)
Published Jan 11, 2024 in cs.CL

Abstract

Transformers are considered conceptually different compared to the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as infinite multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that pretrained transformers can be converted into $\textit{finite}$ multi-state RNNs by fixing the size of their hidden state. We observe that several existing transformers cache compression techniques can be framed as such conversion policies, and introduce a novel policy, TOVA, which is simpler compared to these policies. Our experiments with several long range tasks indicate that TOVA outperforms all other baseline policies, while being nearly on par with the full (infinite) model, and using in some cases only $\frac{1}{8}$ of the original cache size. Our results indicate that transformer decoder LLMs often behave in practice as RNNs. They also lay out the option of mitigating one of their most painful computational bottlenecks - the size of their cache memory. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA.

Transformers compared to infinite and finite multi-state RNNs, highlighting dynamic and fixed-size multi-states.

Overview

  • Transformers, dominant in NLP, manage sequences efficiently and differ from stateful RNNs, yet a new study claims they resemble infinite MSRNNs.

  • The study suggests that decoder-only transformers act similarly to RNNs, preserving states, and introduces finite MSRNNs by fixing the hidden state size.

  • TOVA, a method introduced for transformer compression, leverages attention scores for state retention, effectively reducing cache memory while maintaining performance.

  • Practically, TOVA can significantly cut memory usage and enhance hardware efficiency, indicating transformers operate similarly to finite MSRNNs in certain aspects.

  • The research highlights the potential multi-state capacity of transformers, and how their operational similarity to RNNs opens avenues for further optimization.

Overview of Transformers and RNNs

Transformers have become a staple in NLP, largely due to their ability to efficiently handle sequential data. Their architecture differs significantly from the previously dominant Recurrent Neural Networks (RNNs), which process sequences by maintaining a state of previous inputs. However, a new study puts forward an intriguing perspective that encoder-decoder transformers deeply resemble a particular kind of RNN, known as infinite multi-state RNNs (MSRNNs).

New Insights into Transformer Architecture

The recent study posits that decoder-only transformers, which generate output auto-regressively, align with the core principle of RNNs by preserving a state from one step to the next. What sets transformers apart is that they can be seen as MSRNNs with an unlimited number of states. This perspective allows for the transformer's hidden state size to be fixed, transforming them into finite MSRNNs. This reframing of transformer architecture connects it with established compression techniques already present in the field, and opens the door for new, more efficient policy developments.

Introducing TOVA for Transformer Compression

One such policy, called Token Omission Via Attention (TOVA), simplifies existing policies by using attention scores to determine which tokens to retain in the state. The research showcases TOVA's effectiveness across several long-range tasks, where it performs comparably to transformers with full (infinite) cache, despite using a fraction of the original cache memory. This establishes TOVA as an efficient and potent method for converting transformers into finite MSRNNs, potentially reducing computational costs with minimal impact on performance.

Practical Implications and Benefits

The findings of this study have considerable practical implications. With TOVA, the memory consumption during inference for LLMs was reduced by up to 88%, which could significantly increase batch sizes and improve hardware utilization. While transformers were traditionally seen as distinct from RNNs, this study bridges the two, revealing that in practice, transformer decoder LLMs often function as finite MSRNNs. With this new understanding, developers and researchers in AI could optimize transformer models, making them more accessible and efficient.

The study concludes by emphasizing that while transformers are conceptualized as having an infinite multi-state capacity, they often behave like RNNs with a limited capacity, paving the way for further optimization and analysis in how these models process and retain information across long sequences.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Transformers Are Multi-State RNNs (41 points, 9 comments)