Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

Published 24 Feb 2020 in cs.LG, cs.CV, and stat.ML | (2002.10444v3)

Abstract: Batch normalization dramatically increases the largest trainable depth of residual networks, and this benefit has been crucial to the empirical success of deep residual networks on a wide range of benchmarks. We show that this key benefit arises because, at initialization, batch normalization downscales the residual branch relative to the skip connection, by a normalizing factor on the order of the square root of the network depth. This ensures that, early in training, the function computed by normalized residual blocks in deep networks is close to the identity function (on average). We use this insight to develop a simple initialization scheme that can train deep residual networks without normalization. We also provide a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (20)

View on Semantic Scholar

Summary

The paper demonstrates that Batch Normalization biases residual blocks towards the identity function at initialization by downscaling residual branches.
A novel initialization scheme called SkipInit is introduced and shown to enable training deep residual networks without normalization layers.
The study analyzes learning dynamics, finding that while BN supports larger learning rates, its benefits are more pronounced at large batch sizes.

Summary of "Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks"

The paper "Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks" by Soham De and Samuel L. Smith presents a detailed investigation into the utility of batch normalization, particularly in the context of deep residual neural networks. The study proposes that batch normalization enables the efficient training of deep residual networks by biasing the residual blocks towards the identity function at initialization. This phenomenon contributes significantly to the favorable gradient properties observed in these networks.

Key Contributions and Analysis

Influence of Batch Normalization: The authors demonstrate that batch normalization downscales hidden activations within the residual branch relativistically to the skip connection. This downscale factor correlates to the square root of the network depth at initialization, which ensures that, early in training, the function computed by residual blocks approximates an identity function. This condition is beneficial for maintaining uniform signal propagation and manageable gradient norms throughout the network.
Empirical Validation through SkipInit: A novel initialization scheme, termed "SkipInit," was proposed and validated. SkipInit involves a minor alteration: a learnable scalar initialized to zero on the residual branch. Crucially, networks utilizing this scheme are shown to be trainable without explicitly employing normalization techniques, confirming the central thesis regarding the effect of batch normalization.
Exploring Learning Dynamics: The paper further explores the dynamics of batch normalization by analyzing a wide range of compute regimes. It is evident from the study that while batch normalization allows networks to train with larger learning rates, such advantages are primarily beneficial at large batch sizes, where the constraints posed by gradient noise diminish.
Beyond Traditional Norms: A thorough empirical study complements the theoretical findings. The authors clarify misconceptions surrounding the utility of large learning rates, as previously posited by other works, and establish the benefits of batch normalization in terms of network depth trainability as a central theme.

Implications and Future Directions

This work critically informs the ongoing discourse on optimization methods and architecture design, particularly in the domain of deep learning. Practically, it emphasizes the possibility of training extremely deep residual networks without reliance on normalization layers, provided the residual branches are suitably initialized. This could lead to simplifications in network architectures and potentially reduce computational overhead during training.

Theoretically, the insights revealed about biases towards identity functions open new avenues for reconsidering initialization schemes across various neural architectures. It also compels a reevaluation of the role normalization layers play in influencing optimization landscapes and generalization dynamics.

Further research could explore the adaptation of these findings across other neural network forms, such as transformers, where similar constructs are prevalent. This knowledge might refine strategies for performance tuning and architectural design, fostering the development of more robust deep learning models.

In conclusion, the paper significantly advances the understanding of batch normalization's practical and theoretical advantages in residual networks, offering a foundational basis for both immediate application and exploratory research. The insights derived present seminal opportunities for refining deep learning practices with simplified yet potent initialization strategies.

Markdown Report Issue