Abstract: Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
The paper proposes Group Normalization as a robust alternative to Batch Normalization that reduces ResNet-50 validation error by 10.6% on ImageNet with a batch size of 2.
It computes statistics within channel groups rather than across batches, ensuring stable performance even under small batch size and memory constraints.
Empirical results on COCO and Kinetics demonstrate that Group Normalization enhances object detection, segmentation, and video classification accuracy.
Group Normalization: A Robust Alternative to Batch Normalization
Group Normalization (GN), as presented by Yuxin Wu and Kaiming He from Facebook AI Research, addresses critical challenges posed by Batch Normalization (BN) in the context of deep learning, particularly in scenarios with small batch sizes. GN provides a compelling alternative to BN by normalizing features within groups of channels independently of the batch size, thus offering stable performance across a wide range of batch sizes.
Key Issues with Batch Normalization
BN has substantially contributed to the optimization and generalization of deep networks, but it introduces significant limitations, especially when working with small batch sizes. The primary challenge with BN is the dependency on sufficiently large batch sizes to accurately estimate the batch statistics (mean and variance). When the batch size reduces, the stochastic nature of these calculations increases, leading to higher error rates. This dependency also restricts BN's applicability in tasks requiring high-resolution input, where memory constraints enforce smaller batches.
Group Normalization as an Alternative
To circumvent these limitations, GN divides channels into groups and computes the normalization statistics within each group of channels rather than across the batch dimension. This approach allows GN to perform consistently regardless of the batch size, as group-wise normalization does not rely on batch statistics. GN's formulation entails:
μi=m1k∈Si∑xk,σi=m1k∈Si∑(xk−μi)2+ϵ
where Si is the set of pixels used for calculating the mean and variance, defined within each group of GC channels. This framework generalizes to various configurations, such that GN can become LN when G=1 and IN when G=C.
Empirical Evaluations
ImageNet Classification: The ResNet-50 model with GN was evaluated on the ImageNet dataset, showcasing superior robustness to small batch sizes compared to BN. Specifically, with a batch size of 2, GN outperformed BN by 10.6% in terms of validation error (24.1% versus 34.7%).
Object Detection and Segmentation in COCO: GN was comprehensively tested with the Mask R-CNN framework. With a batch size of 1 image/GPU, GN consistently outperformed the frozen BN (BN*) method, demonstrating improvements in both bounding box detection and segmentation average precision (AP). The results highlight GN's efficacy in dealing with high-resolution features inherently requiring smaller batch sizes.
Video Classification in Kinetics: Using Inflated 3D (I3D) networks, GN was compared with BN for video classification. GN maintained stable accuracy across different batch sizes and allowed the model to benefit from longer temporal input lengths without the trade-offs seen with BN. For example, GN improved the top-1 classification accuracy by 1.7% when increasing the frame input from 32 to 64 frames while maintaining a small batch size of 4.
Implications and Future Directions
The most notable implication of GN is its potential to enable larger-capacity models and more flexible architectures unrestricted by the memory and batch size constraints imposed by BN. This opens opportunities for more sophisticated designs and broader applicability, particularly in tasks requiring high-resolution inputs or small batches due to computational limitations.
Theoretical extensions to other domains of deep learning, particularly those involving sequential and generative models, are warranted given GN's structural similarities to LN and IN. Further research could validate GN's performance in reinforcement learning tasks where BN is currently instrumental.
Overall, Group Normalization presents a robust, scalable alternative to Batch Normalization, addressing essential limitations and offering enhanced performance stability across diverse deep learning tasks.