Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes (1711.04325v1)

Published 12 Nov 2017 in cs.DC, cs.CV, and cs.LG

Abstract: We demonstrate that training ResNet-50 on ImageNet for 90 epochs can be achieved in 15 minutes with 1024 Tesla P100 GPUs. This was made possible by using a large minibatch size of 32k. To maintain accuracy with this large minibatch size, we employed several techniques such as RMSprop warm-up, batch normalization without moving averages, and a slow-start learning rate schedule. This paper also describes the details of the hardware and software of the system used to achieve the above performance.

Citations (309)

View on Semantic Scholar

Summary

The paper introduces an innovative approach that uses a 32,768 minibatch to train ResNet-50 on ImageNet in just 15 minutes.
It employs algorithmic strategies like RMSprop warm-up, slow-start learning rates, and adjusted batch normalization to manage large minibatch challenges.
The study achieves high scaling efficiency with 1024 GPUs, reaching 70–80% efficiency and a top-1 accuracy of 74.94% on validation.

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

The paper "Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes" by Akiba, Suzuki, and Fukuda from Preferred Networks, Inc., presents an advancement in the field of distributed deep learning by demonstrating the capacity to train ResNet-50 for 90 epochs on ImageNet within a mere 15 minutes using 1024 Tesla P100 GPUs. This achievement hinges on the innovative application of extremely large minibatch sizes, specifically 32,768, paired with a series of technical optimizations.

Core Contributions and Methodological Innovations

The authors address the dual challenge of sustaining accuracy with large minibatches and engineering a robust system design. They enhance algorithmic performance through refined training protocols, including:

RMSprop Warm-up: Introduced as a preliminary optimization strategy, the technique transitions to Stochastic Gradient Descent (SGD) to mitigate the difficulty at training inception.
Slow-Start Learning Rate Schedule: A carefully structured learning rate schedule is employed, characterized by an elongated initial phase and a reduced learning rate to facilitate smoother convergence initially.
Batch Normalization Adjustment: Given the larger minibatch size, the traditional moving averages in batch normalization are found inadequate. The authors opt to calculate these statistics across the entirety of workers just before validation, ensuring precision.

On the systems side, the authors exploit a meticulously curated combination of software and hardware components:

Software Architecture: Their setup utilizes Chainer alongside ChainerMN for distributed deep learning, ensuring seamless communication via NCCL and Open MPI. An interesting aspect is the adoption of half-precision floats for communication, which effectively reduces overhead without significantly affecting accuracy.
Hardware Configuration: The experiments are conducted on an in-house cluster, MN-1, which integrates 128 nodes, each equipped with eight NVIDIA Tesla P100 GPUs and interconnected via Mellanox Infiniband FDR, underscoring the significance of a well-designed hardware backbone in achieving the reported results.

Experimental Validation and Results

The empirical results affirm the proposed configuration's efficiency and effectiveness:

Training Time: The total training duration reported is approximately 898 seconds for the complete 90-epoch cycle, evidencing a high scaling efficiency — 70% and 80% efficiency relative to single-GPU and single-node baselines, respectively.
Accuracy Achievement: The paper reports a top-1 single-crop validation accuracy of 74.94%, aligning closely with baseline figures from previous studies. This substantiates the premise that increasing minibatch sizes does not necessarily detract from model fidelity, provided optimizations are correctly applied.

Implications and Future Directions

The research has critical implications for the scalability of neural network training paradigms. By validating efficient training with exceedingly large minibatch sizes, it provides a pathway for accelerated deep learning model development without compromising accuracy. Practically, this can significantly enhance productivity and expedite the training of larger models or datasets.

Theoretically, it challenges prevailing assumptions regarding the adverse effects of large minibatch sizes on convergence, opening ground for further exploration into adaptive methodologies that balance computational efficiency with accuracy maintenance.

Looking ahead, this work may inspire further exploration into diverse domains such as optimizing the interplay between hardware advancements and algorithmic innovations, potentially guiding the development of new frameworks or strategies that exploit the full potential of modern computational resources. Additionally, it sets a benchmark for efficiency in distributed neural network training, a subject of sustained interest in the continuous evolution of artificial intelligence.

PDF Markdown

Related Papers

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour (2017)
Image Classification at Supercomputer Scale (2018)
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes (2018)
ImageNet Training in Minutes (2017)
PowerAI DDL (2017)