Asynchronous Local-SGD Training for Language Modeling (2401.09135v2)
Abstract: Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training LLMs; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.
- On data dependence in distributed stochastic optimization. arXiv preprint arXiv:1603.04379, 2016.
- Petals: Collaborative inference and fine-tuning of large models. arXiv preprint library, 2022.
- Gregory Francis Coppola. Iterative parameter mixing for distributed large-margin training of structured predictors for natural language processing. 2015.
- Large scale distributed deep networks. Advances in neural information processing systems, 25, 2012.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems (NeurIPS), 2021a.
- Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021b.
- Diloco: Distributed low-communication training of language models. arXiv preprint arXiv:2311.08105, 2023.
- Why (and when) does local sgd generalize better than sgd? arXiv preprint arXiv:2303.01215, 2023.
- Scaling federated learning for fine-tuning of large language models. In International Conference on Applications of Natural Language to Information Systems, pages 15–23. Springer, 2021.
- Training compute-optimal large language models. Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132–5143. PMLR, 2020.
- Breaking the centralized barrier for cross-device federated learning. Advances in Neural Information Processing Systems, 34:28663–28676, 2021.
- Parallel asynchronous particle swarm optimization. International journal for numerical methods in engineering, 67(4):578–595, 2006.
- Asynchronous parallel stochastic gradient for nonconvex optimization. Advances in neural information processing systems, 28, 2015.
- Asynchronous decentralized parallel stochastic gradient descent. In International Conference on Machine Learning, pages 3043–3052. PMLR, 2018.
- Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
- Don’t use large mini-batches, use local sgd. Proceedings of the International Conference on Learning Representations (ICLR), 2020.
- Distributed training strategies for the structured perceptron. In Human language technologies: The 2010 annual conference of the North American chapter of the association for computational linguistics, pages 456–464, 2010.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
- Federated learning with buffered asynchronous aggregation. In International Conference on Artificial Intelligence and Statistics, pages 3581–3607. PMLR, 2022.
- Shawn Presser. Swarm training, 2020. URL https://battle.shawwn.com/swarm-training-v01a.pdf.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 2020.
- Hogwild!: A lock-free approach to parallelizing stochastic gradient descent. Advances in neural information processing systems, 24, 2011.
- Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
- Scaling language model size in cross-device federated learning. arXiv preprint arXiv:2204.09715, 2022.
- Moshpit sgd: Communication-efficient decentralized training on heterogeneous unreliable devices. Advances in Neural Information Processing Systems, 34:18195–18211, 2021.
- Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
- Asynchronous federated optimization. arXiv preprint arXiv:1903.03934, 2019.
- Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
- Timelyfl: Heterogeneity-aware asynchronous federated learning with adaptive partial training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5063–5072, 2023.
- Asynchronous stochastic gradient descent with delay compensation. In International Conference on Machine Learning, pages 4120–4129. PMLR, 2017.
- Parallelized stochastic gradient descent. Advances in neural information processing systems, 23, 2010.