Sparse is Enough in Scaling Transformers

Published 24 Nov 2021 in cs.LG and cs.CL | (2111.12763v1)

Abstract: Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study sparse variants for all layers in the Transformer and propose Scaling Transformers, a family of next generation Transformer models that use sparse layers to scale efficiently and perform unbatched decoding much faster than the standard Transformer as we scale up the model size. Surprisingly, the sparse layers are enough to obtain the same perplexity as the standard Transformer with the same number of parameters. We also integrate with prior sparsity approaches to attention and enable fast inference on long sequences even with limited memory. This results in performance competitive to the state-of-the-art on long text summarization.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (95)

View on Semantic Scholar

Summary

The paper demonstrates that sparse mechanisms applied to feedforward, QKV, and loss layers enable Transformer models to achieve performance comparable to dense architectures.
It employs dynamic sparsity frameworks and composite attention mechanisms that yield decoding speedups of up to 20x for large-scale models.
These findings challenge the traditional dense paradigm, suggesting that well-structured sparsity can scale models efficiently while reducing computational costs.

Sparse is Enough in Scaling Transformers: A Technical Overview

The paper "Sparse is Enough in Scaling Transformers" explores the integration of sparsity into Transformer architectures to enhance efficiency without sacrificing performance. The authors focus on crafting a family of models they dub "Scaling Transformers," which incorporate sparse layers in various components of the Transformer model to reduce computational overhead and increase decoding speed. This research notably challenges the prevailing paradigm that only dense Transformers can achieve state-of-the-art results, presenting evidence that sparse architectures can perform equally well while offering significant operational benefits.

Key Contributions

The core contribution of this paper is the demonstration that sparsity mechanisms can be effectively used across all key components of the Transformer architecture—particularly in the feedforward, QKV (query, key, value), and loss layers—yielding performance comparable to fully dense models. A nuanced methodology for sparsifying these components is introduced:

Sparse Feedforward Layers: The authors propose a dynamic sparsity framework using sparsely activated units based on Gumbel-Softmax, which reduces the number of active network components during inference.
Sparse QKV Layers: They develop a composite approach, employing a multiplicative layer followed by a convolutional mechanism, ensuring each attention head can access comprehensive representational components.
Sparse Loss Layers: The density-to-sparseness transition is extended to the final layer that computes outputs, leveraging the proposed multiplicative layer.

Results

The experimental findings are substantial:

The sparse architecture achieves over 2.6x speedup in decoding times for a model with 800M parameters and up to 20x for a 17B parameter model, as demonstrated in rigorous comparisons against the dense baseline.
Models maintain perplexity on par with dense counterparts when evaluated on the C4 dataset, supporting claims of equivalent expressive power and accuracy.
The Transformer architecture is adapted into the "Terraformer," catering to long sequence tasks with features like reversible layers for memory efficiency and the use of sparse attention mechanisms from the Reformer model.

Implications and Future Directions

Practical Implications: The substantial reductions in computation and speed improvement positions sparse Transformers as a highly viable alternative for scaling up LLMs without commensurate increases in resource demand. The ability to train and infer efficiently is particularly appealing for environments with constrained computational or budgetary resources.

Theoretical Implications: The results challenge the traditional notion that denser models are inherently superior, suggesting that appropriately structured sparsity, if leveraged correctly, does not detract from model capability. This opens avenues for further exploration into the theoretical underpinnings of sparsity in large-scale models.

Future Developments: Acknowledging the focus on inference speed without parallel enhancements in training efficiency provides an avenue for future research. Integration of sparsity with techniques like quantization could compound benefits, further optimizing both training and inference phases. Exploring the impact of sparsity across varied architectural configurations and tasks could yield insights on universal applications of this approach.

Conclusion

This research makes a compelling case that sparse implementations can offer efficiencies previously reserved for dense architectures, without the anticipated compromises in performance metrics. As computational demands for AI models grow, the findings from this paper emphasize that dense is not always necessary—sparse is enough. This work not only optimizes current models but also paves the way for more sustainable and accessible AI technology. The proposal to leverage community expertise to refine and expand upon these findings can foster broader adoption and innovation in Transformer-based architectures.

Markdown Report Issue