Emergent Mind

Abstract

Linear attention is an efficient attention mechanism that has recently emerged as a promising alternative to conventional softmax attention. With its ability to process tokens in linear computational complexities, linear attention, in theory, can handle sequences of unlimited length without sacrificing speed, i.e., maintaining a constant training speed for various sequence lengths with a fixed memory consumption. However, due to the issue with cumulative summation (cumsum), current linear attention algorithms cannot demonstrate their theoretical advantage in a causal setting. In this paper, we present Lightning Attention-2, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits. To achieve this, we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation. Specifically, we utilize the conventional attention computation mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks. A tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. We implement our algorithm in Triton to make it IO-aware and hardware-friendly. Various experiments are conducted on different model sizes and sequence lengths. Lightning Attention-2 retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms. The source code is available at https://github.com/OpenNLPLab/lightning-attention.

Comparison of FlashAttention vs. Lightning Attention efficiency across different model sizes and sequence lengths.

Overview

  • The paper introduces 'Lightning Attention-2', a novel linear attention mechanism for LLMs that tackles the computational challenges of long input sequences.

  • It innovates with a 'divide and conquer' approach to realize the theoretical benefits of linear attention, aiming to be efficient on GPU hardware.

  • Lightning Attention-2 implements a tiling technique and uses Triton for optimization to maintain speed and memory efficiency during training and inference.

  • Empirical testing shows that Lightning Attention-2 outperforms previous mechanisms like Lightning Attention-1 and FlashAttention-2 in terms of speed and memory usage, without compromising on accuracy.

  • This new mechanism allows for the scaling of LLMs to handle unlimited sequence lengths while optimizing computational resources.

Introduction

The Transformer architecture has become ubiquitous in the field of LLMs, demonstrating impressive adoptions across a variety of applications. However, one of the core components of the Transformer, the softmax attention mechanism, poses computational challenges, particularly when handling very long input sequences due to its quadratic complexity. Addressing this bottleneck, researchers have explored linear attention as a promising alternative.

The Limitation of Linear Attention

Although linear attention offers an enticing theoretical advantage—being able to process sequences of any length without sacrificing training speed for a given memory capacity—practical application reveals obstacles. The main issue is that the theoretical gains fail to translate to real-world performance benefits due to hardware limitations. Notably, when used in a causal setting—important for tasks like language modeling—existing linear attention mechanisms struggle to demonstrate their advantage due to the limitations of cumulative summation (cumsum) operations.

Lightning Attention-2

The paper presents 'Lightning Attention-2,' a new linear attention mechanism that claims to be the first to fully realize the theoretical computational benefits of linear attention. This is achieved through a novel approach termed 'divide and conquer.' By separately handling intra-block and inter-block components and applying linear attention kernel tricks for the latter, Lightning Attention-2 is designed to efficiently utilize the graphics processing unit (GPU) hardware.

The process includes a tiling technique that works in both forward and backward operations, making it IO-aware and hardware-friendly. Its practical implementation is facilitated by Triton, an intermediate language, and compiler for optimizing neural network computations on GPUs. This new approach promises consistent training and inference speeds regardless of sequence length, without increasing memory consumption.

Empirical Evaluation

Lightning Attention-2 has been rigorously tested across various model sizes and sequence lengths. Researchers contrasted its performance against other methods, such as a previous iteration called Lightning Attention-1 and the FlashAttention-2 algorithm. The outcomes were clear: Lightning Attention-2 demonstrated significantly faster computational speeds and exhibited a consistent training performance as the sequence length increased. Furthermore, it maintained a lower memory footprint in comparison with FlashAttention-2, while ensuring accuracy was not compromised.

In summary, Lightning Attention-2 represents a significant step forward in attention mechanism design for LLMs, offering sustainable scaling capabilities for increasingly large models and providing a pathway towards handling unlimited sequence lengths with greater efficiency.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.