Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 40 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 103 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Don't Use Large Mini-Batches, Use Local SGD (1808.07217v6)

Published 22 Aug 2018 in cs.LG and stat.ML

Abstract: Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants.

Citations (416)

Summary

  • The paper demonstrates that local SGD overcomes generalization issues of large mini-batches by promoting flatter minima.
  • It introduces a post-local SGD strategy that combines early mini-batch SGD with local updates to boost performance and efficiency.
  • Experiments on CIFAR-10/100 and ImageNet validate local SGD’s scalability and communication efficiency in distributed training.

Overview of "Don't Use Large Mini-Batches, Use Local SGD"

The paper "Don't Use Large Mini-Batches, Use Local SGD" by Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi presents an investigation into the training of deep neural networks using local stochastic gradient descent (SGD) as an alternative to large mini-batch SGD. The motivation for this work arises from the observed deterioration in generalization performance when using very large mini-batches in distributed training environments. The authors propose local SGD as a more effective and communication-efficient approach to distributed training that can achieve comparable results to traditional small mini-batch SGD while also offering improved scalability.

Key Contributions

  1. Challenges with Large Mini-Batches: The paper identifies the generalization challenges faced when using large mini-batches. While larger batches improve efficiency and parallelism, they tend to lead models to converge to sharper minima, which negatively impacts their generalization ability.
  2. Local SGD as a Solution: Local SGD (also referred to as local-update SGD or federated averaging) is proposed as a solution to these challenges. Instead of synchronizing model updates across all workers after each batch, local SGD allows for multiple local updates before communication. This process effectively injects controlled noise into the optimization process, which can enhance exploration of the solution space and lead to flatter minima.
  3. Post-Local SGD Strategy: A significant contribution is the introduction of post-local SGD, where standard mini-batch SGD is used in the initial training phase, and local SGD is employed thereafter. This hybrid approach marries the fast convergence properties of standard mini-batch SGD with the generalization benefits of local SGD.
  4. Empirical Results: The paper includes extensive experiments on standard benchmarks like CIFAR-10/100 and ImageNet, demonstrating that local SGD not only matches but often surpasses the performance of large mini-batch SGD in terms of test accuracy and communication efficiency. Notably, post-local SGD closes the generalization gap observed in large mini-batch training.
  5. Theoretical Insight and Future Directions: While this paper focuses on empirical results, it provides a foundation for future theoretical work on the convergence properties of local SGD. There is speculation about the role of noise in gradient descent dynamics and its positive correlation with finding flatter minima.

Implications and Future Directions

The implications of this work are considerable for both the practical and theoretical landscape of distributed deep learning. Practically, the adoption of local SGD can lead to more efficient use of distributed resources without compromising model performance on unseen data. Theoretically, this work prompts further investigation into the dynamics of noise in optimization processes and its effect on generalization, especially in non-convex landscapes. Future research might focus on refining adaptation strategies for the number of local updates and developing learning rate schedules specifically catered to local SGD.

In summary, the authors challenge the conventional inclination towards ever-increasing mini-batch sizes by providing a viable alternative in the form of local SGD. This paper paves the way for broader adoption and further investigation into adaptive, noise-augmented training methods in machine learning.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 33 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com