Emergent Mind

Linearizing Large Language Models

(2405.06640)
Published May 10, 2024 in cs.CL

Abstract

Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training LLMs requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.

Pre-trained LLMs converted to RNNs using SUPRA outperform RWKV in language tasks, inheriting RNN limitations.

Overview

  • The SUPRA method, short for Scalable UPtraining for Recurrent Attention, transforms pre-trained transformers into RNNs to combine the advantages of both models: strong pre-training and efficient inference.

  • SUPRA introduces a hybrid training approach that modifies pre-trained transformers to mimic RNN behavior during inference, using techniques like GroupNorm and rotary positional encoding adjustments.

  • The uptrained models have shown promising results in standard language benchmarks and posed challenges in long-context tasks, highlighting areas for potential future enhancements.

Understanding SUPRA: A New Approach to Linearize Pre-trained Transformers into RNNs

Overview of the Proposed Method

The freshly introduced method called Scalable UPtraining for Recurrent Attention (SUPRA) seeks a cost-effective way to transform pre-trained transformers into Recurrent Neural Networks (RNNs). This approach could potentially leverage the strengths of both architectures — the powerful pre-training of transformers and the cost-efficient inference capabilities of RNNs.

The Challenge with Linear Transformers

Conventional transformers excel due to their parallelizable nature, which provides high efficiency in training over long sequences. However, they suffer from high inference costs relative to RNNs, which maintain a fixed-size hidden state and are, as a result, generally more memory-efficient.

Despite the introduction of linear transformers as a subtype that aims to replicate the trainable advantages of the transformer while catering to memory efficiency, they typically fall short of conventional transformers on intensive Natural Language Processing benchmarks.

Enter SUPRA: A Hybrid Training Approach

SUPRA introduces a middle ground by uptraining - a process of continuing training with a modified architecture. This method begins with a well-established pre-trained transformer and subtly adjusts it to imitate an RNN during inference time.

The Process

  1. Linearization Technique: Replace the softmax normalization commonly found in transformers with GroupNorm, a type of normalization that can help balance the output across different nodes in a network.
  2. Positional Encoding Adjustment: Integrate a rotary positional encoding scheme to cater to disturbances often encountered with absolute positional encoding in RNNs.

SUPRA cleverly addresses an issue inherent to other linear approaches by utilizing only a small fraction of the original training tokens for uptraining. This ensures cost-effectiveness while maintaining competitive model performance.

Testing the Performance

The uptrained models were rigorously evaluated across several benchmarks:

  • Standard Language Benchmarks: SUPRA models displayed competitive performance against leading pretrained recurrent models using notably less data and compute resources.
  • Long-Context Tasks: Despite their promise, the linearized models under SUPRA showed limitations in tasks requiring extended context, underscoring a gap that still exists with conventional transformers.

The Implications and Future Prospects

The birth of SUPRA shifts the landscape for how large-scale models could be transformed to achieve efficiency without an enormous compute overhead. Practically, this could make RNNs viable again for applications where inference cost and resource efficiency are critical.

On the Theoretical Side

SUPRA shows that there's a rich vein to explore in the hybrid modeling approach, potentially setting the stage for future research focusing on optimizing these hybrid architectures.

Looking Forward

While SUPRA demonstrates a promising approach, the models' struggle with long-context tasks suggests a need for further tweaking. Innovations such as more complex gating mechanisms and alternative normalization techniques could potentially bridge the observed performance gap.

Conclusion

SUPRA presents an intriguing prospect in the quest for efficient AI modeling, offering a new toolkit for those looking to harness the strengths of transformers and RNNs alike. With continued development, SUPRA or its derivatives might soon become a staple in reducing computational costs while sustaining high performance across a range of AI tasks.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

HackerNews
Linearizing Large Language Models (2 points, 0 comments)