ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours (2404.11068v1)

Published 17 Apr 2024 in cs.LG, cs.AI, cs.DC, and q-bio.QM

Abstract: AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.

References (11)

Authors (9)

Feiwen Zhu (5 papers)
Arkadiusz Nowaczynski (1 paper)
Rundong Li (11 papers)
Jie Xin (3 papers)
Yifei Song (8 papers)
Michal Marcinkiewicz (6 papers)
Sukru Burc Eryilmaz (5 papers)
Jun Yang (357 papers)
Michael Andersch (5 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces ScaleFold, a method that reduces AlphaFold pretraining from seven days to just 10 hours by enhancing communication efficiency and reducing computation overheads.
It employs a non-blocking data pipeline, CUDA Graphs, and custom Triton-based kernel optimizations to achieve over a 6x speedup on distributed NVIDIA H100 GPU systems.
The findings promise accelerated protein structure prediction, paving the way for rapid advancements in biological research and drug discovery.

Enhancing AlphaFold Training with ScaleFold: Acceleration and Scalability on NVIDIA H100 GPUs

Introduction

The recent work introduces ScaleFold, an optimized technique for training the AlphaFold model, significantly reducing the initial training time while scaling up computational resources effectively. AlphaFold, known for its methodological advancement in protein structure prediction, traditionally confronted issues like high training time and inefficacy in scaling with increased computational resources. ScaleFold addresses these challenges by incorporating various systematic optimizations which markedly improve the existing training protocols.

Core Challenges and ScaleFold Solutions

Identified Challenges

Upon detailed analysis, the paper identifies two predominant barriers in efficient AlphaFold training: communication inefficiencies and computation overheads. These were found to be predominant in distributed training involving multiple GPUs, obstructing effective resource scaling. Specifically, communication bottlenecks due to data pipeline blocks and CPU performance peaks were highlighted, alongside excessive computational overheads from frequent small kernel operations.

ScaleFold Optimizations

ScaleFold proposes methods that enhance both communication efficiency and computational speed:

Non-blocking Data Pipeline: An innovative pipeline prevents training delays by allowing faster data batches to proceed if slower ones are still processing, effectively managing uneven batch preparation times.
Optimized Computation with CUDA Graphs: Use of CUDA Graphs reduces CPU overheads, ensuring smoother operation without CPU performance peaks impacting the GPU execution.
Advanced Kernel Optimizations: Customized kernels for multi-head attention and layer normalization operations were developed using the OpenAI Triton language, addressing inefficiencies in memory utilization and processing speed.

Empirical Evaluation and Results

Performance Benchmarks

ScaleFold's implementation was empirically tested on NVIDIA H100 GPUs against existing models like OpenFold and FastFold. The results depicted a substantial reduction in per-step training time, achieving over a 6x speedup in the MLPerf HPC V3.0 OpenFold benchmark with a 7.51-minute finish using 2080 NVIDIA GPUs.

Training Efficiency

For a comprehensive assessment, ScaleFold was evaluated from scratch training to pretraining phases. It completed pretraining in just 10 hours, a dramatic improvement from the seven days required by conventional training methods. In terms of scalability, ScaleFold scaled the training efficiently across 2080 NVIDIA H100 GPUs, where prior models struggled beyond 512 GPUs.

Theoretical and Practical Implications

Theoretical Insights

The paper offers significant insights into the challenges of scaling deep learning models in high-performance computing environments, especially for complex tasks like protein folding prediction. It uncovers the disproportionate impact of inefficient communications and computational overheads on scaling efficiency.

Practical Relevance

Practically, ScaleFold paves the way for more rapid advancements in protein structure prediction and other similar biocomputational tasks, potentially accelerating drug discovery and other biological research requiring protein structure analysis.

Future Directions

The introduction of ScaleFold invites future studies to explore further optimizations in data handling and algorithmic efficiency for other complex models. Additionally, extending these techniques to other domains of computational biology could catalyze advancements across multiple areas of health and disease research.

Conclusion

ScaleFold emerges as a robust solution that not only enhances the training efficiency of AlphaFold models but also contributes broadly to the computational biology field by enabling rapid, scalable, and efficient computation capabilities. Its development marks a significant step forward in utilizing AI-driven methodologies for scientific discovery in protein folding and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MoAlQuraishi/status/1781735512857845781

https://twitter.com/blacktanktoplab/status/1785126677137502246

https://twitter.com/fly51fly/status/1782055276335813059

https://twitter.com/CryoKhan/status/1781388745285202046

https://twitter.com/BioSpace9/status/1788763391126892823

https://twitter.com/ceobillionaire/status/1781780402434080847