LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

Published 26 Jun 2024 in cs.DC | (2406.18485v1)

Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mechanism, which combines both head-parallel and context-parallel techniques to break the scalability constraints while maintaining efficiency. We introduce Double-Ring-Attention and analyze the performance of device placement strategies to further speed up training. We implement LoongTrain with the hybrid ZeRO and Selective Checkpoint++ techniques. Experiment results show that LoongTrain outperforms state-of-the-art baselines, i.e., DeepSpeed-Ulysses and Megatron Context Parallelism, in both end-to-end training speed and scalability, and improves Model FLOPs Utilization (MFU) by up to 2.88x.

Abstract PDF HTML Upgrade to Chat

Authors (14)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces EpicSeq, a system featuring the 2D-Attention mechanism that combines head and context parallelism to efficiently scale long-sequence LLM training on GPU clusters.
To enhance communication efficiency, EpicSeq employs Double-Ring-Attention, partitioning GPUs into multiple inner rings for parallel computation within context parallel groups.
Evaluation results demonstrate that EpicSeq significantly outperforms existing sequence parallelism methods, achieving up to 2.88x higher MFU for training 7B models with long sequences.

The paper introduces EpicSeq, a system designed for efficient training of LLMs with long sequences on large-scale Graphics Processing Unit (GPU) clusters. It addresses limitations in existing sequence parallelism approaches, such as head parallelism and context parallelism, which encounter scalability and communication efficiency issues, respectively. EpicSeq's core innovation is the 2D-Attention mechanism, which combines head-parallel and context-parallel techniques to overcome scalability constraints while maintaining efficiency.

The limitations of existing sequence parallelism approaches are:

Head parallelism's scalability is inherently limited by the number of attention heads.
Context parallelism suffers from communication inefficiencies due to peer-to-peer communication, leading to low intra-node and inter-node bandwidth utilization.

EpicSeq proposes a hybrid approach to overcome these limitations.

2D-Attention Mechanism

The 2D-Attention mechanism parallelizes attention across both head and context dimensions. It distributes the query (Q), key (K), and value (V) tensors across GPUs based on the head dimension and partitions them into chunks within the context dimension. The number of GPUs, $d_{sp}$ , is organized into a $d_{hp} \times d_{cp}$ grid where:

$d_{sp} = d_{hp} \times d_{cp}$

$d_{hp}$ is the head parallel size
$d_{cp}$ is the context parallel size

In multi-head attention, the input tensors $Q$ , $K$ , and $V$ are divided along the sequence dimension, where each segment is shaped as $(H, S/d_{sp}, D/H)$ .

$H$ is the number of attention heads
$d_{hp} \times d_{cp}$ 0 is the sequence length
$d_{hp} \times d_{cp}$ 1 is the hidden dimension size

The 2D-Attention computation involves three steps:

A SeqAlltoAll communication operation distributes the $d_{hp} \times d_{cp}$ 2, $d_{hp} \times d_{cp}$ 3, and $d_{hp} \times d_{cp}$ 4 tensors based on the head dimension across $d_{hp} \times d_{cp}$ 5 GPUs and re-partitions them along the sequence dimension across $d_{hp} \times d_{cp}$ 6 GPUs.
Each context parallel group independently performs Double-Ring-Attention, resulting in an output tensor of shape $d_{hp} \times d_{cp}$ 7.
Another SeqAlltoAll operation consolidates the attention outputs across the head dimension and re-partitions the sequence dimension, transforming the output tensor to $d_{hp} \times d_{cp}$ 8.

To address the constraint of limited KV heads in Grouped Query Attention (GQA), EpicSeq uses KV replication. In the forward pass, the input KV tensors are shaped as $d_{hp} \times d_{cp}$ 9. To align the number of KV heads with the head-parallel size, 2D-Attention replicates KV tensors, resulting in the shape of $d_{sp} = d_{hp} \times d_{cp}$ 0, where $d_{sp} = d_{hp} \times d_{cp}$ 1.

Double-Ring-Attention

To fully utilize available Network Interface Cards (NICs) for inter-node communication, the paper proposes Double-Ring-Attention, which partitions the $d_{sp} = d_{hp} \times d_{cp}$ 2 GPUs into multiple inner rings. The Central Processing Units (CPUs) within each context parallel group form several inner rings, while the inner rings collectively form an outer ring. Assuming each inner ring consists of $d_{sp} = d_{hp} \times d_{cp}$ 3 GPUs, a context parallel process group would have $d_{sp} = d_{hp} \times d_{cp}$ 4 concurrent inner rings.

Device Placement Strategies

The paper discusses two device allocation strategies: head-first placement and context-first placement. Head-first placement prioritizes collocating GPUs of the same head parallel group on the same node, leveraging NVLink for SeqAlltoAll operations. Context-first placement prioritizes collocating GPUs of the same context parallel group on the same node, reducing inter-node traffic during Double-Ring-Attention.

Performance Analysis

The paper provides a performance analysis of 2D-Attention, including scalability, computation, peer-to-peer communication, SeqAlltoAll communication, and memory usage. The analysis considers factors such as sequence length, head and context parallelism degrees, inner ring size, Multi-Head Attention (MHA) vs. GQA, and device placement strategies. The goal is to minimize the communication time that cannot be overlapped with computation, which is formulated as:

$d_{sp} = d_{hp} \times d_{cp}$ 5

$d_{sp} = d_{hp} \times d_{cp}$ 6 represents the SeqAlltoAll communication time.
$d_{sp} = d_{hp} \times d_{cp}$ 7 and $d_{sp} = d_{hp} \times d_{cp}$ 8 represent the forward and backward execution time per inner ring
$d_{sp} = d_{hp} \times d_{cp}$ 9 is the context parallel size
$d_{hp}$ 0 is the inner ring size

End-to-End System Implementation

The paper discusses the end-to-end system implementation of EpicSeq with two techniques: hybrid Zero Redundancy Optimizer (ZeRO) and selective checkpoint++. The hybrid ZeRO approach shards model states across both Data Parallelism (DP) and sequence parallelism dimensions, reducing redundant memory usage. Selective checkpoint++ adds attention modules to a whitelist. During the forward pass, the modified checkpoint function saves the outputs of these modules. During the backward pass, the checkpoint function retrieves the stored outputs and continues the computation graph.

Evaluation Results

The paper presents experimental results comparing EpicSeq with DeepSpeed-Ulysses and Megatron Context Parallelism. The results demonstrate that EpicSeq outperforms these baselines in both end-to-end training speed and scalability, improving Model FLOPs Utilization (MFU) by up to 2.88x. The evaluation includes training 7B-MHA and 7B-GQA models on 64 GPUs with various sequence lengths and configurations. The results highlight the benefits of 2D-Attention, Double-Ring-Attention, and Selective Checkpoint++.

Markdown Report Issue