Hyena Hierarchy: Towards Larger Convolutional Language Models

Published 21 Feb 2023 in cs.LG and cs.CL | (2302.10866v3)

Abstract: Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on state-spaces and other implicit and explicit methods, matching attention-based models. We set a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K.

Abstract PDF Upgrade to Chat

Authors (9)

Citations (227)

View on Semantic Scholar

Summary

The paper presents Hyena, a novel convolution-based operator that replaces quadratic attention with long convolutions and data-controlled gating.
It achieves over 50% accuracy improvement on recall and reasoning tasks while matching state-of-the-art language models on key datasets.
The design demonstrates scalable learning with significant speed-ups, running 2x faster at 8K and 100x faster at 64K sequence lengths compared to traditional models.

Hyena Hierarchy: Towards Larger Convolutional LLMs

The paper introduces "Hyena," a novel convolutional architecture designed as a subquadratic alternative to the traditional attention mechanism used in Transformers, with a specific focus on language modeling tasks. Traditional Transformers leverage the attention mechanism, which, despite its effectiveness, incurs a quadratic computational cost in sequence length. This cost becomes prohibitive when modeling sequences with a significantly large context, motivating the exploration of alternatives like Hyena that offer efficient scalability.

Motivation

The primary motivation behind Hyena is to break free from the quadratic scaling barrier in sequence length inherent in the attention mechanism. Attention-based models are increasingly challenged by applications requiring extensive context, such as processing large documents or gigapixel images, due to their computational and memory inefficiencies. Although various subquadratic attention approaches have been explored, they often either sacrifice accuracy when used standalone or require hybridization with dense attention layers to achieve comparable results to Transformers.

Hyena Architecture

Hyena introduces an innovative operator based on long convolutions and data-controlled gating, which can be considered as a hierarchical replacement for the attention mechanism. The architecture consists of:

Long Convolutions: Unlike finite impulse response (FIR) filters, Hyena employs convolutions with filter sizes that match the input sequence length. These filters are parameterized implicitly via neural networks (typically feed-forward networks), enabling them to capture dependencies across long sequences effectively without incurring quadratic costs.
Data-Controlled Gating: Hyena applies multiplicative gating mechanisms to modulate the signal, akin to adapting the computation based on the input data, enhancing expressivity and enabling the model to handle various tasks effectively.

Key Features

Hyena maintains several advantageous properties over traditional attention mechanisms:

Sublinear Parameter Scaling: The number of parameters does not grow with sequence length, allowing resources to be allocated to other computational modules within neural networks.
Efficient Computational Complexity: The model has been shown to provide better time complexity (~O(sequence length log sequence length)) in comparison to the quadratic complexity of attention.
Versatile Learning Capabilities: Despite being an attention-free architecture, Hyena demonstrates the capability to learn context at scale and generalize well across different domains, such as language modeling and vision tasks.

Experimental Results

Hyena significantly narrows the performance gap with attention-based models across several tasks:

It achieves more than a 50% improvement in accuracy over other subquadratic methods on recall and reasoning tasks.
It matches or outperforms the state-of-the-art in language modeling on datasets like WikiText103 and The Pile, while requiring approximately 20% less training computational resources at standard sequence lengths.
For long sequence tasks, Hyena exhibits remarkable efficiency, demonstrating substantial speed-ups over optimized attention implementations, notably 2x faster at 8K sequence lengths and 100x faster at 64K sequence lengths.

Implications and Future Prospects

Hyena offers a promising direction for the development of efficient large-scale models capable of handling extended contexts across applications. Its design principles could be pivotal in applications extending beyond language, potentially reshaping how various other sequence modeling challenges such as audio and video processing, biological signal processing, and more are approached.

Given its scalability and efficiency, future work could focus on further optimizing Hyena's convolutional operators for integration with specialized hardware and extending its applicability across even broader domains, including reinforcement learning and generative modeling of multimedia content.

Overall, Hyena represents a compelling step toward redefining convolutional LLMs to be competitive with, or potentially surpass, their attention-based counterparts in terms of both performance and computational efficiency.

Markdown Report Issue