Hungry Hungry Hippos: Towards Language Modeling with State Space Models (2212.14052v3)

Published 28 Dec 2022 in cs.LG and cs.CL

Abstract: State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in LLMing. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in LLMing, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic LLMing tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on LLMing, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid LLMs to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention LLMs up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

Citations (293)

View on Semantic Scholar

Summary

The paper introduces the H3 layer that bridges SSMs and attention, enabling effective token comparison and improved language modeling performance.
The paper develops FlashConv, an FFT-based convolution algorithm that doubles performance on long sequences and achieves up to 2.4x faster text generation than Transformers.
Hybrid models combining H3 and attention outperform traditional Transformers, scaling to 2.7 billion parameters and delivering competitive results on benchmarks like OpenWebText.

Overview of "Hungry Hungry Hippos: Towards LLMing with State Space Models"

This paper explores the application of State Space Models (SSMs) in LLMing, specifically addressing the challenges and inefficiencies associated with traditional attention-based models. The authors present two key contributions: the development of a novel SSM-based layer labeled H3, and the introduction of a hardware-efficient algorithm, FlashConv, to enhance the computational performance of SSMs.

H3: Bridging the Gap Between SSMs and Attention

The paper identifies specific deficiencies in SSMs compared to Transformers in handling LLMing tasks. They note that SSMs lack the ability to effectively recall and compare tokens across sequences, which are critical for language understanding. To address this, the H3 layer is introduced, integrating two discrete SSMs with multiplicative interactions that emulate the capabilities of attention mechanisms.

The authors demonstrate that H3 performs competitively with attention on synthetic language tasks designed to simulate in-context learning, and brings SSM performance on par with Transformers on natural language benchmarks like OpenWebText. Notably, a hybrid model combining H3 and attention layers surpasses traditional Transformers by 1.0 PPL on OpenWebText, signifying a step forward in SSM's applicability to LLMing.

FlashConv: Addressing Computational Inefficiency

While SSMs theoretically offer linear time complexity concerning sequence length, their practical implementation has lagged behind Transformers due to inefficient hardware utilization. The authors propose FlashConv, an advanced FFT-based convolution algorithm that leverages block FFTs to exploit modern GPU architectures.

FlashConv is shown to deliver substantial speedups—doubling performance over previous implementations on long sequences and achieving 2.4x faster text generation than Transformers in hybrid models. This improvement is critical for training large-scale models, evidenced by successful scaling of hybrid H3-attention models to 2.7 billion parameters without compromising the speed benefits.

Implications and Future Directions

This research makes significant progress in enhancing SSMs for LLMing by addressing expressivity gaps and computational inefficiencies. The transformation provided by the H3 layer and the efficiency achieved with FlashConv suggest that SSMs could be pivotal in future NLP applications, especially in scenarios where long sequence processing and reduced computational cost are essential.

Looking forward, the work opens up opportunities to further refine SSM architectures, potentially leading to models that combine the best traits of both SSMs and attention mechanisms for broader and more robust language understanding. Moreover, with the encouraging results in large model scalabilities, future research might explore even larger SSM-based models, exploring their capacities in diverse AI tasks beyond language.

In conclusion, the paper presents a substantial advance in LLMing paradigms, providing exciting prospects for future research in both the theoretical and applied dimensions of AI and LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/hi_tysam/status/1775896144738226563

YouTube

Show All Videos