Emergent Mind

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a Single GPU

(2403.06504)

Published Mar 11, 2024 in cs.DC

Abstract

Recent advances in LLMs have brought immense value to the world, with their superior capabilities stemming from the massive number of parameters they utilize. However, even the GPUs with the highest memory capacities, currently peaking at 80GB, are far from sufficient to accommodate these vast parameters and their associated optimizer states when conducting stochastic gradient descent-based optimization. One approach to hosting such huge models is to aggregate device memory from many GPUs. However, this approach introduces prohibitive costs for most academic researchers, who always have a limited budget for many high-end GPU servers. In this paper, we focus on huge model fine-tuning on a single, even low-end, GPU in a commodity server, which is accessible to most AI researchers. In such a scenario, the state-of-the-art work ZeRO-Infinity suffers from two severe issues when running in a commodity server: 1) low GPU utilization due to inefficient swapping, and 2) limited trainable model size due to CPU memory capacity. The underlying reason is that ZeRO-Infinity is optimized for running on high-end GPU servers. To this end, we present Fuyou, a low-cost training framework that enables efficient 100B huge model fine-tuning on a low-end server with a low-end GPU and limited CPU memory capacity. The key idea is to add the SSD-CPU communication as an optimization dimension and thus carefully co-optimize computation and data swapping from a systematic approach to maximize GPU utilization. The experimental results show that 1) Fuyou is able to fine-tune 175B GPT-3 on a consumer GPU RTX 4090 with high GPU utilization, while ZeRO-Infinity fails to fine-tune; and 2) when training a small GPT-3 13B model, Fuyou achieves 156 TFLOPS on an RTX 4090 GPU while ZeRO-Infinity only achieves 45 TFLOPS.

GPU throughput for fine-tuning GPT-3 models on A100-80GB showcased in the Fuyou project.

Overview

Introduces \SystemName, a low-cost training framework enabling efficient fine-tuning of 100 billion parameter models on commodity GPUs.
Innovates on GPU utilization and CPU-SSD communication to maximize training efficiency without impacting model convergence rates.
Employs GPU-CPU-SSD pipelines and automatic activation swapping to manage memory hierarchies, facilitating training of larger models.
\SystemName achieves significant performance improvements over existing methods, demonstrating its potential to democratize LLM fine-tuning.

Accelerating Large Language Model Fine-tuning on Commodity Hardware with \SystemName

Introduction

Training and fine-tuning LLMs such as GPT-3 have historically been resource-intensive tasks reserved for high-end GPU servers, making it an elusive endeavor for most academic researchers with limited budgets. This research paper introduces \SystemName, a novel low-cost training framework that makes it feasible to efficiently fine-tune 100 billion parameter models on a single, even low-end, GPU in a commodity server. The paper lays out the challenges present in existing fine-tuning methods, particularly focusing on the issues faced by ZeRO-Infinity when operating in a commodity server environment. By advancing a systematic approach that maximizes GPU utilization through the innovative use of SSD-CPU communication, \SystemName aims to democratize the fine-tuning of enormous models.

Key Innovations

The study presents three significant contributions to the field:

Synchronous Out-of-core CPU Optimizer Overlapped with Backward Propagation: This innovation significantly increases GPU utilization by ensuring that the GPU is not left idle during the optimizer state updates, which are offloaded to the CPU and SSDs. This approach effectively taps into unused CPU resources during GPU operation, maximizing training efficiency without compromising the model's convergence rate.
GPU-CPU-SSD Fully-Pipelined Activation Swapping Mechanism: Addressing the limitations imposed by CPU and GPU memory capacity, this mechanism allows for the fine-tuning of substantially larger models by extending the memory hierarchy to SSDs. This full pipeline enables a strategic data swap between GPU memory, CPU memory, and NVMe SSDs, facilitating the training of models whose size is otherwise restrained by the memory capacities of the GPU and CPU.
Automatic Activation Swapping Management: With an understanding that not all activations need to be swapped, and that the amount of data to be swapped can significantly impact training times, \SystemName introduces an automated mechanism to determine the optimal volume of activations to be swapped to SSD. This method balances the trade-offs between epoch time and the GPU, CPU, and SSD bandwidth utilization, further refining the efficiency of the model training process.

Experimental Results and Implications

\SystemName's effectiveness is underscored by its ability to fine-tune models (such as the GPT-3 175B model) on consumer-grade hardware (like the RTX 4090 GPU) with high GPU utilization, outperforming existing solutions like ZeRO-Infinity which fails to fine-tune at this scale. Importantly, when training smaller models like the GPT-3 13B, \SystemName achieved up to 156 TFLOPS on an RTX 4090 GPU, significantly outpacing ZeRO-Infinity.

These results underline the potential of \SystemName to bring the fine-tuning of LLMs within reach of a broader segment of the AI research community, including those limited by budget constraints. By leveraging affordable, widely available hardware to achieve previously unattainable performance levels, \SystemName paves the way for more inclusive research endeavors and innovation in LLM applications.

Conclusion and Future Directions

The introduction of \SystemName represents a significant leap forward in the ability to fine-tune large-scale models efficiently on commodity hardware. Looking ahead, there are numerous avenues for future research and development, including further optimization of the data swapping mechanism, exploration of the model's scalability across different types of commodity hardware, and the application of \SystemName's approach to other domains beyond LLMs. By continuing to lower the barriers to entry for training sophisticated AI models, we can expect to see accelerated innovation and broader participation in AI research and development.

Create an account to read this summary for free:

HackerNews

Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-Tuning on a Single GPU (3 points, 1 comment)

https://twitter.com/_akhaliq/status/1767393991727657262

https://twitter.com/dejanseo/status/1768107321664938488

https://twitter.com/essobi/status/1767611138961711156

https://twitter.com/knishimae0531/status/1767517653370085562

https://twitter.com/knishimae0531/status/1767750917733158928