Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Published 29 Nov 2018 in cs.LG, cs.CV, and stat.ML | (1811.12019v5)

Abstract: Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size over epochs and layers, or some ad hoc modification of the batch normalization. We propose an alternative approach using a second-order optimization method that shows similar generalization capability to first-order methods, but converges faster and can handle larger mini-batches. To test our method on a benchmark where highly optimized first-order methods are available as references, we train ResNet-50 on ImageNet. We converged to 75% Top-1 validation accuracy in 35 epochs for mini-batch sizes under 16,384, and achieved 75% even with a mini-batch size of 131,072, which took only 978 iterations.

Abstract PDF Upgrade to Chat

Citations (93)

View on Semantic Scholar

Summary

The paper introduces a novel second-order K-FAC method that maintains model generalization when training with large mini-batches.
It leverages distributed data- and model-parallel strategies with mixed precision and symmetry-aware communication for efficient computation.
Experiments on ResNet-50 with ImageNet show 75% Top-1 accuracy in 35 epochs, even when using mini-batch sizes up to 131,072.

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

In the paper titled "Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks," the authors propose an advanced optimization methodology to address the challenges of training deep neural networks on large-scale systems. The core innovation lies in adopting a second-order optimization technique, Kronecker-Factored Approximate Curvature (K-FAC), which provides substantial benefits over traditional stochastic gradient descent (SGD) methods, particularly when training with large mini-batches.

Overview

The research presents a methodical investigation into overcoming the generalization gap often encountered when increasing the mini-batch size during distributed training of deep neural networks. Conventionally, this problem is addressed using adapted learning rates, varied batch sizes, and other empirical methods. The authors propose K-FAC as a mathematically rigorous alternative capable of maintaining generalization while enabling faster convergence with larger mini-batches. Their distributed implementation leverages both data-parallel and model-parallel strategies, alongside efficient computation using mixed precision and symmetry-aware communication.

Numerical Results

The paper showcases impressive results using the ResNet-50 architecture on the ImageNet dataset. The authors report achieving 75% Top-1 validation accuracy within 35 epochs for mini-batch sizes up to 16,384. Remarkably, they maintain the same accuracy with a mini-batch size of 131,072 in just 978 iterations. These results substantiate the capability of K-FAC to handle large batch sizes efficiently, a feat challenging for conventional SGD approaches.

Implications

The implications of this study are multifaceted. Practically, it facilitates the training of larger models quicker and resource-efficiently, which is crucial for deploying AI applications at scale. Theoretically, it invites further exploration into second-order methods as viable alternatives for deep learning optimization. By enhancing the statistical stability of each mini-batch, K-FAC could potentially alter the dynamics of convergence strategies adopted by machine learning frameworks.

Future Directions

Moving forward, further refinement of K-FAC's operations could improve computational efficiency and scalability. The potential to approximate the Fisher information matrix more aggressively without compromising accuracy suggests additional avenues for research. These innovations may lead to enhancements in optimizer design, yielding deeper insights into the convergence characteristics of second-order methods versus highly optimized first-order methods.

In conclusion, while second-order optimizers like K-FAC might not yet be widely adopted, this paper provides significant groundwork demonstrating their potential where large-scale distributed training contexts are concerned. The ongoing evolution in AI research stands to gain from these insights, especially in scenarios demanding rapid model prototyping and deployment.

Markdown Report Issue