Learning to (Learn at Test Time): RNNs with Expressive Hidden States (2407.04620v3)

Published 5 Jul 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.

Citations (44)

View on Semantic Scholar

Summary

The paper introduces TTT layers that update hidden states with self-supervised learning at test time, enabling efficient long-context sequence modeling.
The methodology features TTT-Linear and TTT-MLP instantiations that outperform both Transformers and Mamba in perplexity reduction and computational speed.
The results demonstrate that TTT-Linear can match Mamba’s speed and overcome Transformer limitations, offering a practical building block for large language models.

Here is a detailed analysis of the paper you provided:

Introduction and Paper Overview

Title: Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Authors: Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, Carlos Guestrin
Research Question: The paper addresses the challenge of sequence modeling with linear complexity while maintaining expressive power for long contexts, aiming to overcome the limitations of existing RNNs and the quadratic complexity of Transformers.

TL;DR:

The paper introduces Test-Time Training (TTT) layers, a novel sequence modeling approach where the hidden state is a machine learning model itself, updated through self-supervised learning, achieving linear complexity and expressive hidden states for effective long-context learning.

Key Terms:

Test-Time Training (TTT): A method where the model's hidden state is updated by training on the test sequence itself, effectively learning during the test phase.
RNN Layers: Recurrent Neural Network layers process sequential data by maintaining a hidden state that is updated at each step.
Self-Attention: A mechanism used in Transformer networks that allows the model to attend to different parts of the input sequence when processing it.
Linear Complexity: Computational complexity that increases linearly with the input size, making it efficient for long sequences.
Quadratic Complexity: Computational complexity that increases quadratically with the input size, limiting scalability for long sequences.

This work relates to previous studies by re-evaluating the scaling limitations of RNNs compared to Transformers, addressing the bottleneck of RNNs in handling long-context information, and drawing inspiration from self-supervised learning to enhance the expressive power of hidden states.

Application:

Problem Addressed: The research addresses the need for sequence models that can efficiently process long contexts without sacrificing performance, which is crucial for tasks requiring understanding of extended dependencies in data.

Key Findings and Results

The proposed TTT-Linear and TTT-MLP layers match or exceed the performance of strong Transformer models and Mamba, a modern RNN, in sequence modeling tasks.
TTT layers, similar to Transformers, can continue to reduce perplexity by conditioning on more tokens, unlike Mamba, which plateaus after 16k context.
With systems optimization, TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time.
Unexpected Outcome: While TTT-MLP shows larger potential in long context, it faces challenges in memory I/O, indicating areas for future research.

Methodology

Design of TTT Layers: The researchers designed a new class of sequence modeling layers where the hidden state is a model, and the update rule is a step of self-supervised learning.
Instantiations: They created two instantiations: TTT-Linear and TTT-MLP, with linear model and a two-layer MLP, respectively, as hidden states.
Integration: TTT layers are integrated into any network architecture and optimized end-to-end, similar to RNNs layers and self-attention.
Optimization: Improved hardware efficiency through mini-batch TTT and the dual form.

The proposed method differs from existing approaches by using self-supervised learning to update the hidden state, allowing the model to learn and adapt during test time. This contrasts with traditional RNNs that have fixed hidden state update rules and Transformers that have quadratic complexity.

Results and Evaluation:

TTT-Linear and TTT-MLP outperform Transformers and Mamba in evaluations ranging from 125M to 1.3B parameters. TTT-Linear has comparable performance as Mamba at 2k context and better performance at 8k.
TTT-Linear is faster than Transformer at 8k context and matches Mamba in wall-clock time.
The biggest improvement comes from mini-batch TTT (changing from $b=T=2048$ to $b=16$ ). The second comes from instantiating the inner model $f$ with LN and residual connection

Practical Deployment and Usability:

The research has significant real-world applicability for improving the efficiency and performance of LLMs, especially in tasks that require processing long sequences.
The TTT-Linear layer is already a practical building block for LLMs due to its improved hardware efficiency.
These findings can be implemented in practice by integrating TTT layers into existing network architectures, using mini-batch TTT and the dual form for faster training and inference.

Limitations, Assumptions, and Caveats:

Strengths: The research introduces a novel approach to sequence modeling that addresses the limitations of existing RNNs and Transformers, offering a promising direction for future research.
Limitations: TTT-MLP faces challenges in memory I/O, and the evaluations do not cleanly fit a linear scaling trend.
Assumptions: The Chinchilla recipe is followed for training, and the evaluations are performed on specific datasets (Pile and Books3).
The research reveals gaps in knowledge regarding the optimization and scaling of TTT layers, especially TTT-MLP, and the need for further exploration of self-supervised tasks.

Key Takeaways and Conclusion

TTT layers offer a new way to approach sequence modeling by incorporating self-supervised learning into the hidden state update rule.
TTT-Linear is a practical and efficient building block for LLMs, while TTT-MLP shows potential for long context but requires further optimization.
The overall contribution of this work is the introduction of TTT layers as a competitive alternative to Transformers and RNNs, with a focus on linear complexity and expressive hidden states.
Obvious next steps that aren't mentioned in the paper include exploring more sophisticated self-supervised tasks, optimizing TTT-MLP for memory I/O, and scaling TTT layers to even longer contexts and larger models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/xiaolonw/status/1810387662060269668

https://twitter.com/arcprize/status/1826736168698085460

https://twitter.com/karansdalal/status/1909445278710874574

https://twitter.com/TheAITimeline/status/1811056478511395041

https://twitter.com/fly51fly/status/1810215493195968540

https://twitter.com/samdcbu/status/1821092509172559879

YouTube

Show All Videos

HackerNews

Learning to (Learn at Test Time): RNNs with Expressive Hidden States (9 points, 0 comments)