How to Start Training: The Effect of Initialization and Architecture

Published 5 Mar 2018 in stat.ML and cs.LG | (1803.01719v3)

Abstract: We identify and study two common failure modes for early training in deep ReLU nets. For each we give a rigorous proof of when it occurs and how to avoid it, for fully connected and residual architectures. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided. In contrast, for fully connected nets, we prove that this failure mode can happen and is avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.

Abstract PDF Upgrade to Chat

Authors (2)

Citations (249)

View on Semantic Scholar

Summary

The paper identifies two critical failure modes in deep ReLU networks and proposes initialization and architectural strategies to mitigate them.
The authors show that proper weight initialization with variance 2/fan-in prevents exploding or vanishing activations in fully connected and convolutional networks, while residual networks benefit from scaled residual blocks.
Empirical validation confirms that following these guidelines enables deeper networks to train more efficiently, offering actionable insights for overcoming early training obstacles.

How to Start Training: The Effect of Initialization and Architecture

This paper by Boris Hanin and David Rolnick rigorously addresses critical aspects of neural network training, focusing specifically on initialization and architectural choices. The work identifies two prevalent failure modes during the early training of deep ReLU networks and provides theoretical and empirical insights on how to mitigate these issues across multiple architectures, including fully connected networks, convolutional networks, and residual networks.

Failure Modes in Deep Learning

The paper delineates two primary failure modes:

FM1 (Exploding or Vanishing Mean Activation Length): This mode occurs when the mean length scale of activations in the final layer increases or decreases exponentially with depth.
FM2 (Exponential Growth of Activation Length Variance): Here, the variance of activation lengths across layers grows exponentially with depth. The paper asserts that while FM1 is affected by initialization, FM2 is architecture-dependent.

Key Results and Contributions

FM1 Avoidance: The authors demonstrate that FM1 can be circumvented by proper weight initialization. For fully connected and convolutional networks, initialization from a symmetric distribution with variance $2/\text{fan-in}$ is recommended. For residual networks, the emphasis is on scaling the residual blocks appropriately.
FM2 Avoidance: For fully connected networks, FM2 can be mitigated through architectural adjustments by keeping the sum of the reciprocals of layer widths constant or growing linearly with depth. This constraint is relaxed in residual networks, where FM2 is not a concern once FM1 is avoided.
Empirical Validation: Empirical studies affirm the theoretical predictions. Networks initialized according to the paper's guidelines begin training more effectively, especially as network depth increases. The empirical results also highlight the inadequacies of several popular initializations that do not comply with the proposed variance requirements.

Implications and Future Directions

The theoretical framework and results have practical ramifications for designing and initializing deep networks. By addressing FM1 and FM2, the paper contributes to advancing deep learning architectures that are not only deeper but also more efficient in terms of training time. The distinctions drawn between different architectures—fully connected, convolutional, and residual networks—provide valuable insights into why certain architectures, like ResNets, are empirically robust.

Looking forward, further exploration might explore analyzing other activation functions beyond ReLU and different network configurations such as recurrent networks. The potential application of these findings could influence initialized practices in novel architectures and complex tasks.

Conclusion

This paper contributes a comprehensive analysis on how initialization and architecture influence the ease of training deep neural networks. The rigorous identification of failure modes and their solutions highlights crucial considerations for the deep learning community. Through theoretical guarantees and empirical validation, Hanin and Rolnick provide a structured approach to overcoming early training obstacles, thus enriching the toolkit available for training robust and deep neural networks.

Markdown Report Issue