A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library (2312.11918v1)
Abstract: We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations, and choosing optimal tile sizes for the Q, K and V attention matrices while balancing the register pressure and shared memory utilization. In head-to-head benchmarks on a single H100 PCIe GPU for some common choices of hyperparameters, we observe 20-50% higher FLOPs/s over a version of FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.
- Developing CUDA Kernels for Accelerated Matrix Multiplication on NVIDIA Hopper Architecture using the CUTLASS Library. Colfax Research. 2023. https://research.colfax-intl.com/nvidia-hopper-gemm-cutlass/
- FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. Tri Dao. July 17, 2023. https://arxiv.org/abs/2307.08691.
- FlashAttention — Fast and memory-efficient exact attention. https://github.com/Dao-AILab/flash-attention
- FlashAttention adoption. https://github.com/Dao-AILab/flash-attention/blob/main/usage.md.
- Setting New Records at Data Center Scale Using NVIDIA H100 GPUs and NVIDIA Quantum-2 InfiniBand. Ashraf Eassa and Sukru Burc Eryilmaz. November 8, 2023. https://developer.nvidia.com/blog/setting-new-records-at-data-center-scale-using-nvidia-h100-gpus-and-quantum-2-infiniband/.
- CUTLASS — CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.
- CuTe Layouts. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/01_layout.md.
- CuTe Tensors. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/03_tensor.md.
- CuTe’s support for Matrix Multiply-Accumulate instructions. https://github.com/NVIDIA/cutlass/blob/main/media/docs/cute/0t_mma_atom.md.
- Efficient GEMM in CUDA. https://github.com/NVIDIA/cutlass/blob/main/media/docs/efficient_gemm.md.
- NVIDIA H100 Tensor Core GPU Datasheet. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
- NVIDIA Hopper Tuning Guide. https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html
- Parallel Thread Execution ISA Version 8.2. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
- Online normalizer calculation for softmax. Maxim Milakov and Natalia Gimelshein. July 28, 2018. https://arxiv.org/abs/1805.02867.
- TensorRT-LLM 0.5.0. https://github.com/NVIDIA/TensorRT-LLM/tree/release/0.5.0.
- Using Shared Memory in CUDA C/C++. Mark Harris. January 28, 2013. https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/.
- Faster Parallel Reductions on Kepler. Justin Luitjens. February 13, 2014. https://developer.nvidia.com/blog/faster-parallel-reductions-kepler/.
- How Nvidia’s CUDA Monopoly In Machine Learning Is Breaking - OpenAI Triton And PyTorch 2.0. Dylan Patel. January 16, 2023. https://www.semianalysis.com/p/nvidiaopenaitritonpytorch.