Emergent Mind

The Illusion of State in State-Space Models

(2404.08819)
Published Apr 12, 2024 in cs.LG , cs.CC , cs.CL , and cs.FL

Abstract

State-space models (SSMs) have emerged as a potential alternative architecture for building LLMs compared to the previously ubiquitous transformer architecture. One theoretical weakness of transformers is that they cannot express certain kinds of sequential computation and state tracking (Merrill & Sabharwal, 2023), which SSMs are explicitly designed to address via their close architectural similarity to recurrent neural networks (RNNs). But do SSMs truly have an advantage (over transformers) in expressive power for state tracking? Surprisingly, the answer is no. Our analysis reveals that the expressive power of SSMs is limited very similarly to transformers: SSMs cannot express computation outside the complexity class $\mathsf{TC}0$. In particular, this means they cannot solve simple state-tracking problems like permutation composition. It follows that SSMs are provably unable to accurately track chess moves with certain notation, evaluate code, or track entities in a long narrative. To supplement our formal analysis, we report experiments showing that Mamba-style SSMs indeed struggle with state tracking. Thus, despite its recurrent formulation, the "state" in an SSM is an illusion: SSMs have similar expressiveness limitations to non-recurrent models like transformers, which may fundamentally limit their ability to solve real-world state-tracking problems.

Comparison of model layers needed for >90% accuracy on group multiplication by sequence length and group.

Overview

  • The paper disputes the claimed superiority of State-Space Models (SSMs) over transformers in managing complex state sequences, using theoretical and empirical analysis.

  • It demonstrates that both SSMs and transformers fall within the same complexity class $TC0$, challenging the notion that SSMs can better track state information for tasks like chess moves or code evaluations.

  • Experimental results show that SSMs and transformers struggle with state tracking in permutation composition, a key aspect of sequential problems, in contrast to the effectiveness of RNNs.

  • Proposes minimal extensions to SSMs, such as nonlinear activation functions and input-dependent transition matrices, offering a potential pathway to enhance their state tracking capabilities beyond $TC0$.

The Illusion of State in State-Space Models

Introduction

State-Space Models (SSMs) have been proposed as a potential improvement over transformers, especially with their theoretical advantages in state tracking and handling inherently sequential computations. The crux of the advancement hinges on the belief that SSMs, owing to their architectural resemblance to Recurrent Neural Networks (RNNs), exhibit superior expressive power in modeling contexts that demand a high fidelity of state management, such as narrative comprehension, chess move tracking, and code evaluation. This paper undertakes a comprehensive examination of this postulated advantage, critically evaluating whether SSMs indeed stand to offer an elevated expressive capability over transformers in the context of state tracking.

Theoretical Analysis

Our theoretical investigation employs the complexity class $TC0$ as a framework to assess the expressive power of both transformers and SSMs, particularly focusing on their ability to manage state information accurately. We extend existing findings to demonstrate that SSMs, analogous to transformers, are confined within the complexity class $TC0$. This revelation is significant, as it implies that SSMs, contrary to prior assertions, lack the theoretical foundation to express computations that extend beyond $TC0$'s boundaries. Consequently, this challenges the view that SSMs can inherently track complex sequences of states—such as those exemplified by permutation compositions, a fundamental aspect of $NC1$-hard problems, which are crucial for tasks like chess move tracking or code evaluation.

Our critique extends to the analysis of linear SSMs and their near generalizations, where we show despite the recurrent formulation of SSMs, they are similarly incapable as transformers in solving inherently sequential problems.

Empirical Analysis

Complementing our theoretical examination, we provide an empirical analysis leveraging the word problem for permutations ($S_5$) as a representative test case. The experimental results align with our theoretical predictions: SSMs alongside transformers exhibit notable difficulties in addressing state tracking manifested in permutation composition, despite their contrasting architectural designs. Particularly interesting is the observed performance discrepancy with RNNs, which adeptly manage to compose permutations with a singular layer—a stark contrast to the limited expressive power observed in SSMs and transformers.

Proposed Extensions

In light of the limitations identified, we proposed minimal extensions to SSMs aiming to bridge the gap in expressive power for state tracking. These include incorporating nonlinear activation functions and enabling input-dependent transition matrices. These proposed modifications showcase a theoretical ability to navigate beyond the constraints of $TC0$, thereby enabling the models to effectively solve permutation composition problems. However, these extensions warrant further scrutinization, especially concerning their practicality in terms of parallelism and learning dynamics.

Conclusion and Future Directions

Our comprehensive analysis dispels the illusion of statefulness in SSMs, positioning them on par with transformers concerning their state tracking capabilities within the complexity class $TC0$. Despite their architectural differences, both models exhibit similar expressiveness limitations, challenging the notion that SSMs could supplant transformers in tasks requiring intricate state management.

The findings motivate further research into the development of SSM-like architectures that genuinely bridge the expressive power gap for state tracking while maintaining robust parallelizability and favorable learning dynamics. Future explorations could investigate the practical implementations of the proposed SSM extensions and their efficacy in real-world state tracking problems, offering insights into possible architectural innovations that balance the need for expressive power and computational efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
HackerNews
Reddit
[R] The Illusion of State in State-Space Models (22 points, 13 comments) in /r/MachineLearning
The Illusion of State in State-Space Models (9 points, 1 comment) in /r/LocalLLaMA