Image Classification at Supercomputer Scale

Published 16 Nov 2018 in cs.LG, cs.DC, and stat.ML | (1811.06992v2)

Abstract: Deep learning is extremely computationally intensive, and hardware vendors have responded by building faster accelerators in large clusters. Training deep learning models at petaFLOPS scale requires overcoming both algorithmic and systems software challenges. In this paper, we discuss three systems-related optimizations: (1) distributed batch normalization to control per-replica batch sizes, (2) input pipeline optimizations to sustain model throughput, and (3) 2-D torus all-reduce to speed up gradient summation. We combine these optimizations to train ResNet-50 on ImageNet to 76.3% accuracy in 2.2 minutes on a 1024-chip TPU v3 Pod with a training throughput of over 1.05 million images/second and no accuracy drop.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (117)

View on Semantic Scholar

Summary

The paper introduces systems optimizations including distributed batch normalization to maintain validation accuracy across numerous replicas.
Input pipeline enhancements like sharding, caching, and parallel parsing significantly boost data throughput for high-speed TPU configurations.
A novel 2-D torus all-reduce approach reduces latency in gradient summation, enabling a throughput of over 1.05M images per second on a TPU v3 Pod.

Image Classification at Supercomputer Scale

The paper "Image Classification at Supercomputer Scale" presents a comprehensive examination of systems optimizations required to accelerate deep learning tasks to petaFLOPS scale on high-performance hardware, specifically leveraging Google's TPU v3 Pods. Focusing on training the ResNet-50 model on the ImageNet dataset, the authors outline strategies to overcome the algorithmic and systems software challenges inherent in distributed deep learning. These optimizations include distributed batch normalization, input pipeline enhancements, and a novel 2-D torus all-reduce approach for gradient summation.

Key Technical Contributions

The authors introduce a range of technical contributions that aim to optimize the efficiency and scalability of large-scale training tasks:

Distributed Batch Normalization: This technique addresses the challenge of maintaining high validation accuracy when scaling batch size across numerous replicas. By using distributed reduction across replica subsets to compute batch normalization statistics, the effective batch size is controlled independently of global batch size, improving convergence for small per-replica batch sizes.
Input Pipeline Optimization: To prevent bottlenecks between CPU and accelerator computations, the paper discusses four key optimizations: sharding and caching of datasets, prefetching data to pipeline input, parallel parsing, and fused JPEG decode with cropping. These strategies collectively ensure sustained model throughput during training.
2-D Torus All-Reduce: The research improves upon conventional ring-based all-reduce methods by introducing a 2-D mesh algorithm that enhances scalability and reduces latency on TPU Pods. This variant proves beneficial for synchronized gradient summation across extensive chip arrays.

Experiments and Results

The authors meticulously demonstrate the impact of each optimization through controlled experiments. For example, varying the batch size for batch normalization highlights how distributed methods prevent notable accuracy drops, achieving 76.3% with effective batch normalization sizes. Similarly, input pipeline optimizations, including parallel data parsing, significantly bolster data processing rates, crucial for managing the throughput demands of high-speed TPU configurations. Finally, using a 1024-chip TPU v3 Pod, they achieved the noteworthy feat of training ResNet-50 with a throughput of over 1.05 million images per second in just 2.2 minutes, without an accuracy drop.

Implications and Future Directions

The systems optimizations outlined in this paper allow for dramatic reductions in training time for large-scale image classification tasks, enabling substantial improvements in efficiency for research and practical applications of deep learning. These methods are not constrained to TPUs and may be applicable across other architectures, potentially influencing future developments in AI infrastructure.

Looking ahead, the authors suggest exploring mixed data and model parallelism strategies to further optimize training processes. This direction may unlock additional efficiencies, particularly when handling models that exceed the memory capacity of individual devices. As AI models continue to grow in complexity and size, these advancements are critical for ensuring scalability and practicality in deploying deep learning solutions across diverse domains.

Markdown Report Issue