FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Published 11 Jul 2024 in cs.LG and cs.AI | (2407.08608v2)

Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for LLMs and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1.5-2.0$\times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1.2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2.6$\times$ lower numerical error than a baseline FP8 attention.

Abstract PDF HTML Upgrade to Chat

Authors (6)

Citations (44)

View on Semantic Scholar

Summary

The paper introduces a GPU-tuned attention algorithm that uses producer-consumer asynchrony and pipelined GEMM-softmax to achieve 1.5–2.0× speedup on H100 GPUs.
The paper employs a novel two-stage pipelining method that overlaps matrix multiplications and softmax computations to maximize hardware utilization.
The paper adapts FP8 low-precision GEMM, reducing numerical error by 2.6× while quadrupling throughput, thereby enhancing scalable Transformer performance.

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

The paper presents FlashAttention-3, a novel algorithm that enhances the efficiency of attention mechanisms, particularly in large-scale Transformer models, by leveraging recent advancements in GPU hardware. This innovation builds upon the established FlashAttention and FlashAttention-2, targeting performance limitations by employing unique techniques compatible with modern GPU architectures, especially the NVIDIA Hopper H100.

The primary contribution of this research lies in three pivotal techniques: producer-consumer asynchrony, pipelined GEMM-softmax operations, and hardware-accelerated low-precision GEMM via FP8 computation. These methods collectively result in substantial performance gains, evidenced by numerical results showing 1.5-2.0× speedup on H100 GPUs with FP16, reaching up to 740 TFLOPs/s, and nearing 1.2 PFLOPs/s in FP8. Moreover, the redesign features 2.6× reduced numerical error in FP8 compared to existing baselines.

Producer-Consumer Asynchrony: By separating the roles of producing and consuming data into distinct warps, FlashAttention-3 introduces a warp-specialized software pipelining scheme that exploits the asynchronous parallelism of GPU architecture. This approach improves latency hiding capabilities and maximizes hardware efficiency.
GEMM-Softmax Pipelining: The algorithm employs a 2-stage internal pipelining strategy, effectively overlapping the processing of matrix multiplications ('GEMMs') and softmax computations. While these operations are interdependent, careful synchronized scheduling allows for improved throughput by minimizing idle time for GPU compute resources.
FP8 Low-Precision Computation: Adapting the algorithm to utilize FP8 precision involves addressing layout constraints and minimizing quantization errors through block quantization and incoherent processing techniques. These adaptations exploit FP8's quadrupled throughput relative to FP16 while maintaining acceptable accuracy levels.

The implications of FlashAttention-3 span both theoretical and practical realms. Theoretically, the integration of asynchrony and low-precision adjustments serves as a compelling demonstration of hardware-software co-design in action, challenging traditional synchronous and high-precision paradigms. Practically, the accelerated attention mechanisms facilitated by FlashAttention-3 can significantly impact applications requiring long-context processing such as document retrieval, multimedia interaction, and extensive codebase navigation.

Looking ahead, there's potential for broader applicability of these algorithmic innovations beyond Transformers, particularly in efficiency-driven domains such as image and video processing. The paper's approach provides insights for future refinement of GPU-based computations, especially as precision demands and hardware capabilities evolve.

In conclusion, FlashAttention-3 represents a noteworthy advancement in the domain of efficient and accurate attention mechanisms, offering compelling evidence of the benefit derived from aligning algorithmic strategies with hardware developments. Researchers in large-scale learning and hardware acceleration are likely to find these methods advantageous in furthering the design of optimized and scalable AI models.

Markdown Report Issue