Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 431 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Hungry Hungry Hippos: Towards Language Modeling with State Space Models (2212.14052v3)

Published 28 Dec 2022 in cs.LG and cs.CL

Abstract: State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid LLMs to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention LLMs up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

Citations (293)

Summary

  • The paper introduces the H3 layer that bridges SSMs and attention, enabling effective token comparison and improved language modeling performance.
  • The paper develops FlashConv, an FFT-based convolution algorithm that doubles performance on long sequences and achieves up to 2.4x faster text generation than Transformers.
  • Hybrid models combining H3 and attention outperform traditional Transformers, scaling to 2.7 billion parameters and delivering competitive results on benchmarks like OpenWebText.

Overview of "Hungry Hungry Hippos: Towards Language Modeling with State Space Models"

This paper explores the application of State Space Models (SSMs) in language modeling, specifically addressing the challenges and inefficiencies associated with traditional attention-based models. The authors present two key contributions: the development of a novel SSM-based layer labeled H3, and the introduction of a hardware-efficient algorithm, FlashConv, to enhance the computational performance of SSMs.

H3: Bridging the Gap Between SSMs and Attention

The paper identifies specific deficiencies in SSMs compared to Transformers in handling language modeling tasks. They note that SSMs lack the ability to effectively recall and compare tokens across sequences, which are critical for language understanding. To address this, the H3 layer is introduced, integrating two discrete SSMs with multiplicative interactions that emulate the capabilities of attention mechanisms.

The authors demonstrate that H3 performs competitively with attention on synthetic language tasks designed to simulate in-context learning, and brings SSM performance on par with Transformers on natural language benchmarks like OpenWebText. Notably, a hybrid model combining H3 and attention layers surpasses traditional Transformers by 1.0 PPL on OpenWebText, signifying a step forward in SSM's applicability to language modeling.

FlashConv: Addressing Computational Inefficiency

While SSMs theoretically offer linear time complexity concerning sequence length, their practical implementation has lagged behind Transformers due to inefficient hardware utilization. The authors propose FlashConv, an advanced FFT-based convolution algorithm that leverages block FFTs to exploit modern GPU architectures.

FlashConv is shown to deliver substantial speedups—doubling performance over previous implementations on long sequences and achieving 2.4x faster text generation than Transformers in hybrid models. This improvement is critical for training large-scale models, evidenced by successful scaling of hybrid H3-attention models to 2.7 billion parameters without compromising the speed benefits.

Implications and Future Directions

This research makes significant progress in enhancing SSMs for language modeling by addressing expressivity gaps and computational inefficiencies. The transformation provided by the H3 layer and the efficiency achieved with FlashConv suggest that SSMs could be pivotal in future NLP applications, especially in scenarios where long sequence processing and reduced computational cost are essential.

Looking forward, the work opens up opportunities to further refine SSM architectures, potentially leading to models that combine the best traits of both SSMs and attention mechanisms for broader and more robust language understanding. Moreover, with the encouraging results in large model scalabilities, future research might explore even larger SSM-based models, exploring their capacities in diverse AI tasks beyond language.

In conclusion, the paper presents a substantial advance in language modeling paradigms, providing exciting prospects for future research in both the theoretical and applied dimensions of AI and LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 169 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube