Emergent Mind

Asynchronous Local-SGD Training for Language Modeling

(2401.09135)
Published Jan 17, 2024 in cs.LG and cs.CL

Abstract

Local stochastic gradient descent (Local-SGD), also referred to as federated averaging, is an approach to distributed optimization where each device performs more than one SGD update per communication. This work presents an empirical study of {\it asynchronous} Local-SGD for training language models; that is, each worker updates the global parameters as soon as it has finished its SGD steps. We conduct a comprehensive investigation by examining how worker hardware heterogeneity, model size, number of workers, and optimizer could impact the learning performance. We find that with naive implementations, asynchronous Local-SGD takes more iterations to converge than its synchronous counterpart despite updating the (global) model parameters more frequently. We identify momentum acceleration on the global parameters when worker gradients are stale as a key challenge. We propose a novel method that utilizes a delayed Nesterov momentum update and adjusts the workers' local training steps based on their computation speed. This approach, evaluated with models up to 150M parameters on the C4 dataset, matches the performance of synchronous Local-SGD in terms of perplexity per update step, and significantly surpasses it in terms of wall clock time.

Comparison of language models using synchronous and asynchronous Local-SGD with heterogeneous workers on a large model.

Overview

  • The paper discusses the inefficiencies in traditional synchronous updates for training LLMs across multiple devices.

  • Local-SGD allows for reduced communication bottlenecks by enabling local gradient updates before synchronization, and its asynchronous variant further minimizes idle time.

  • Asynchronous Local-SGD faces challenges with stale gradients and momentum; the paper proposes Delayed Nesterov (DN) momentum update and Dynamic Local Updates (DyLU) to mitigate these issues.

  • Extensive experiments show that DN and DyLU can achieve learning effectiveness and time efficiency comparable or superior to synchronous Local-SGD.

  • The research highlights the scalability and robustness of the proposed techniques, suggesting improvements in distributed learning.

Introduction to Asynchronous Local-SGD

LLMs have become crucial in the advancement of machine learning applications, particularly in the field of natural language processing. The traditional way to train such models often involves multiple devices working in tandem using synchronous updates, which can lead to inefficiencies because of communication latency between distributed devices.

Understanding Local-SGD and Its Asynchronous Variant

Local Stochastic Gradient Descent (Local-SGD) offers a way to mitigate the communication bottleneck in distributed training by allowing devices to perform several gradient steps locally before synchronizing. Asynchronous Local-SGD, on the other hand, presents a more dynamic approach, where devices update the global model parameters as soon as they complete their local updates, avoiding the idle time associated with the synchronous method. However, despite its potential, naïve implementations of asynchronous Local-SGD could lead to slower convergence than expected.

Momentum and Heterogeneity in Asynchronous Training

The study reveals a key issue in asynchronous Local-SGD: the use of stale gradients combined with momentum can disrupt the learning process. A stale gradient arises when a device computes updates based on an older version of the model due to inevitable asynchrony. This complication becomes evident with momentum, which accelerates training by combining past gradients with the current one - the study explore the intricacies of this phenomenon. To address the identified challenges, the researchers propose two techniques: Delayed Nesterov (DN) momentum update and Dynamic Local Updates (DyLU). These methods are designed to stabilize and improve the performance of asynchronous Local-SGD for language models.

Experimenting with Novel Techniques

The paper conducts extensive experiments that demonstrate DN and DyLU's ability to match or even surpass synchronous Local-SGD in terms of learning effectiveness and time efficiency, showing promise for these novel methods. The experiments elaborate on how these techniques cope with heterogeneity in device capabilities and variations in the number of workers and model sizes, indicating the methods' robustness and scalability potential.

Concluding Thoughts

In conclusion, asynchronous Local-SGD presents an attractive alternative for efficiently training LLMs across distributed systems. The paper contributes to this burgeoning domain by addressing key challenges and proposing viable solutions that have been empirically validated. The research opens doors to further enhancements in distributed learning, aiming for greater scalability and reduced training time without compromising the quality of language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube