Emergent Mind

Gated Linear Attention Transformers with Hardware-Efficient Training

(2312.06635)
Published Dec 11, 2023 in cs.LG and cs.CL

Abstract

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2(Dao, 2023) as a standalone layer even at short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention (GLA) Transformer is found to perform competitively against the LLaMA-architecture Transformer (Touvron et al., 2023) as well recent linear-time-inference baselines such as RetNet(Sun et al., 2023a) and Mamba (Gu & Dao, 2023) on moderate-scale language modeling experiments. GLA Transformer is especially effective at length generalization, enabling a model trained on 2K to generalize to 28K on PG19 without significant perplexity degradations. For training speed, the GLA Transformer has higher throughput than a similarly-sized Mamba model.

Overview

  • The paper 'Gated Linear Attention Transformers with Hardware-Efficient Training' introduces Gated Linear Attention (GLA) Transformers, featuring a hardware-efficient algorithm known as FlashLinearAttention to improve computational efficiency in hardware-limited environments.

  • GLA Transformers include a data-dependent gating mechanism for dynamic memory management and parallel computation, demonstrating competitive performance on language modeling benchmarks compared to existing models like LLaMA, RetNet, and Mamba.

  • The empirical results highlight GLA Transformers' superior training throughput and effective long-range dependency handling, opening avenues for scaling up, cross-modal applications, and further algorithmic optimizations.

Gated Linear Attention Transformers with Hardware-Efficient Training

Overview

The paper "Gated Linear Attention Transformers with Hardware-Efficient Training" presents advancements in Transformer architectures that leverage linear attention mechanisms to improve computational efficiency, particularly in hardware-limited environments. The proposed Gated Linear Attention (GLA) Transformer introduces a hardware-efficient algorithm for linear attention that strategically manages memory movement and parallelizability. This approach, known as FlashLinearAttention, is benchmarked against softmax attention-based Transformers and other linear attention variants, showing competitive performance on both training speed and model accuracy.

Contributions

The primary contributions of the paper include:

  1. FlashLinearAttention Algorithm: The paper introduces FlashLinearAttention, a novel linear attention algorithm optimized for hardware efficiency. It addresses the inefficiencies of prior linear attention methods by avoiding excessive memory movements and better utilizing parallel computation resources.
  2. Data-Dependent Gating Mechanism: The text extends linear attention with data-dependent gates, creating Gated Linear Attention (GLA). This mechanism replaces the fixed global decay rate in traditional models with a more expressive, data-aware variant that improves model flexibility and performance.
  3. Empirical Benchmarking: Extensive experiments are conducted to validate the GLA Transformer against existing models, such as LLaMA, RetNet, and Mamba, and on various benchmarks. The results indicate that GLA Transformers match or exceed the performance of these baselines on language modeling tasks and exhibit strong length generalization capabilities.

Technical Details

FlashLinearAttention

FlashLinearAttention achieves hardware efficiency through two key strategies:

  1. Tiling and Memory Hierarchy Awareness: The algorithm breaks down computations into tiles that fit into fast, on-chip memory (SRAM), significantly reducing the reliance on slower global memory (HBM).
  2. Parallel and Sequential I/O Operations: Depending on memory constraints, it employs either a materialization approach, holding intermediary states in HBM for increased parallelism, or a non-materialization approach, recomputing states to save memory at the cost of additional computation.

Gated Linear Attention (GLA)

GLA introduces a gating mechanism that dynamically adjusts based on the input data:

  1. Matrix-Valued Gates: Instead of using a fixed decay factor, GLA uses data-dependent gates calculated through a low-rank linear transformation followed by a sigmoid function. This allows finer control over the retention of information across time steps.
  2. Parallel Computation Form: The paper also establishes a parallel form for GLA, demonstrating how efficient chunkwise parallel computation can be achieved despite the complexity added by the gates.

Empirical Results

The empirical evaluation of the GLA Transformer encompasses several dimensions:

  1. Synthetic Tasks: The Multi-Query Associative Recall (MQAR) task shows that GLA outperforms scalar decay-based models like RetNet, validating the effectiveness of the data-dependent gating mechanism.
  2. Language Modeling: On language modeling benchmarks, GLA Transformers exhibit competitive perplexity and accuracy, closely matching or outperforming the state-of-the-art, including the LLaMA architecture.
  3. Training Efficiency: GLA Transformers offer superior training throughput compared to Mamba and traditional Transformers, particularly when leveraging the materialization strategy for handling longer sequences.

Future Directions

The findings in this paper point towards several future research avenues:

  1. Scaling Up: Given the promising empirical results at moderate scales, the next step involves scaling GLA Transformers to larger models and datasets to explore their potential at industry-relevant scales.
  2. Cross-Modal Applications: Extending GLA mechanisms to other domains, such as vision and audio, where long-range dependencies are critical, could further validate its versatility and efficiency.
  3. Further Optimization: Continued enhancements in hardware-aware algorithms, potentially integrating emerging memory technologies or specialized computation units, could further improve the efficiency and performance of GLA Transformers.

Conclusion

The paper offers a significant step forward in the development of efficient Transformer architectures by integrating gated mechanisms into linear attention frameworks and optimizing their implementation for hardware. The GLA Transformer, underpinned by the FlashLinearAttention algorithm, presents a compelling alternative to conventional models, balancing computational efficiency and modeling power. This work opens new pathways for deploying large-scale neural models in resource-constrained environments, maintaining high performance standards.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube