Emergent Mind

Repeat After Me: Transformers are Better than State Space Models at Copying

(2402.01032)
Published Feb 1, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

Transformers are the dominant architecture for sequence modeling, but there is growing interest in models that use a fixed-size latent state that does not depend on the sequence length, which we refer to as "generalized state space models" (GSSMs). In this paper we show that while GSSMs are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context. We start with a theoretical analysis of the simple task of string copying and prove that a two layer transformer can copy strings of exponential length while GSSMs are fundamentally limited by their fixed-size latent state. Empirically, we find that transformers outperform GSSMs in terms of efficiency and generalization on synthetic tasks that require copying the context. Finally, we evaluate pretrained LLMs and find that transformer models dramatically outperform state space models at copying and retrieving information from context. Taken together, these results suggest a fundamental gap between transformers and GSSMs on tasks of practical interest.

Comparison of GSSMs, Hard-Alibi transformer, and others on a modified n-gram task showing memory efficiency.

Overview

  • Transformers have set high performance benchmarks in sequence modeling and are compared to Generalized State Space Models (GSSMs) in this paper.

  • The paper presents a theoretical analysis showing Transformers can copy long sequences due to their storage and retrieval capacities, in contrast to the limitations of GSSMs.

  • Empirical studies with 160 million parameter models show Transformers have superior training efficiency and generalization in context copying tasks.

  • Pretrained LLMs reveal Transformers outperform GSSMs in copying and information retrieval tasks, despite comparable perplexity scores.

  • The paper concludes that while GSSMs may be computationally efficient, Transformers maintain a lead in essential cognitive functions for sequence modeling.

Introduction

Within the domain of sequence modeling, Transformers have set remarkable performance benchmarks on numerous tasks. A stream of research looking to innovate beyond Transformers has introduced Generalized State Space Models (GSSMs) as an alternative offering potential gains in inference-time efficiency. This paper rigorously examines the prominent claims regarding the potential of GSSMs compared to Transformers.

Theoretical Analysis

A central part of the paper is dedicated to a theoretical analysis focusing on the task of string copying—a simple paradigmatic task serving as a litmus test for model capabilities. The authors constructively prove that a two-layer Transformer can copy sequences exponentially longer than its size, capitalizing on its ability to store and retrieve information. Conversely, GSSMs, constrained by their fixed-size latent state, fundamentally lack this capacity; an assertion made based on a state space size analysis vis-à-vis sequence length.

Empirical Validation

Furthering the theoretical insights, the authors embark on substantial empirical studies involving models with approximately 160 million parameters. The outcomes emphatically favor Transformers, which demonstrate not only superior efficiency during training but also more robust generalization capabilities for synthetic tasks requiring context copying. These experiments also reveal the underlying "storage" and retrieval mechanism employed by Transformers, aligning with the authors' theoretical exposition.

Performance on Pre-trained Models

Extending the investigation to pretrained LLMs, the study evaluates the copying and information retrieval abilities of large-scale models. Despite similar or lower perplexity, GSSMs consistently lag behind Transformers in tasks necessitating extensive access to context. This gap in performance accentuates the significance of architectural choices, which, as demonstrated, can affect LLMs' capabilities beyond training perplexity measures.

Conclusions

In the span of this study, the authors lay forth compelling evidence—both theoretical and empirical—solidifying Transformers' superiority over GSSMs in carrying out tasks that require intricate interactions with the input context. While GSSMs indicate improved computational efficiency against sequence length, they fall short in vital cognitive capabilities, such as memorization and retrieval—skills in which Transformers excel. This paper delineates these distinct capabilities between the two architectures, contributing to the nuanced understanding of the trade-offs involved in sequence modeling architectures.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube