ScaleFold: Reducing AlphaFold Initial Training Time to 10 Hours (2404.11068v1)
Abstract: AlphaFold2 has been hailed as a breakthrough in protein folding. It can rapidly predict protein structures with lab-grade accuracy. However, its implementation does not include the necessary training code. OpenFold is the first trainable public reimplementation of AlphaFold. AlphaFold training procedure is prohibitively time-consuming, and gets diminishing benefits from scaling to more compute resources. In this work, we conducted a comprehensive analysis on the AlphaFold training procedure based on Openfold, identified that inefficient communications and overhead-dominated computations were the key factors that prevented the AlphaFold training from effective scaling. We introduced ScaleFold, a systematic training method that incorporated optimizations specifically for these factors. ScaleFold successfully scaled the AlphaFold training to 2080 NVIDIA H100 GPUs with high resource utilization. In the MLPerf HPC v3.0 benchmark, ScaleFold finished the OpenFold benchmark in 7.51 minutes, shown over $6\times$ speedup than the baseline. For training the AlphaFold model from scratch, ScaleFold completed the pretraining in 10 hours, a significant improvement over the seven days required by the original AlphaFold pretraining baseline.
- OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv (2022), 2022–11.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 6557 (2021), 871–876.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174 (2016).
- FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours. (2022).
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
- MLPerf™ HPC: A holistic benchmark suite for scientific machine learning on HPC systems. In 2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC). IEEE, 33–45.
- Highly accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021), 583–589.
- Gradient coding: Avoiding stragglers in distributed learning. In International Conference on Machine Learning. PMLR, 3368–3376.
- Deepspeed team and OpenFold team. 2023. DS4Sci_EvoformerAttention: eliminating memory explosion problems for scaling Evoformer-centric structural biology models. https://deepspeed4science.ai/2023/09/18/model-showcase-openfold/
- Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 10–19.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Feiwen Zhu (5 papers)
- Arkadiusz Nowaczynski (1 paper)
- Rundong Li (11 papers)
- Jie Xin (3 papers)
- Yifei Song (8 papers)
- Michal Marcinkiewicz (6 papers)
- Sukru Burc Eryilmaz (5 papers)
- Jun Yang (357 papers)
- Michael Andersch (5 papers)