All you need is a good init (1511.06422v7)

Published 19 Nov 2015 in cs.LG

Abstract: Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

Authors (2)

Dmytro Mishkin (23 papers)
Jiri Matas (133 papers)

Citations (600)

View on Semantic Scholar

Summary

The paper introduces LSUV initialization, a two-step process that first pre-initializes with orthonormal matrices then normalizes each layer's output variance.
Experimental validation on MNIST, CIFAR-10/100, and ImageNet demonstrates faster convergence and competitive accuracy compared to traditional methods.
LSUV's effectiveness with various activation functions and network architectures offers a practical alternative to complex initialization schemes.

Analysis of "All you need is a good init"

The paper "All you need is a good init" by Dmytro Mishkin and Jiri Matas introduces the Layer-sequential unit-variance (LSUV) initialization, a method for initializing weights in deep neural networks. This approach aims to enhance the efficacy of training very deep convolutional neural networks (CNNs) by ensuring stability through appropriate initialization, particularly addressing challenges in networks deeper than traditional configurations.

Key Contributions

LSUV initialization offers a straightforward, two-step process:

Weights of each convolution or inner-product layer are pre-initialized using orthonormal matrices.
Layers are processed sequentially, normalizing the variance of each layer's output to one.

The motivation stems from difficulties in training thin, deep networks using standard initialization techniques. The LSUV method provides a practical solution by converging training faster and maintaining competitive accuracy levels, comparable to or surpassing existing complex schemes like FitNets and Highway Networks.

Experimental Validation

The authors validate the LSUV method across multiple architectures, including GoogLeNet, CaffeNet, and FitNets, demonstrating state-of-the-art or near state-of-the-art performance on datasets such as MNIST, CIFAR-10/100, and ImageNet. Notably, the LSUV initialization shows impressive convergence characteristics without substantial computational overhead compared to batch normalization techniques.

Numerical Results and Claims

The paper makes strong claims supported by numerical results:

On the CIFAR-10 and CIFAR-100 datasets, LSUV-initialized networks achieve high accuracy levels (93.94% and 70.04%, respectively) rivaling more resource-intensive methods.
On the MNIST dataset, LSUV achieves better performance compared to orthonormal and Hints initialization. The error rate drops to 0.48%, showing a clear edge over traditional methods.

A point of interest is the performance across various activation functions, including ReLU, VLReLU, tanh, and maxout. LSUV consistently provides robust initializations across these functions, which is significant given the differing characteristics of these nonlinearities.

Implications and Future Directions

The implications of the proposed LSUV method are both practical and theoretical:

Practical Implications: By simplifying the initialization process, LSUV facilitates efficient training of deep networks, reducing the need for extended empirical searches or adjustments in complex initializations.
Theoretical Implications: The consistent performance across different architectures and activation functions suggests a broader applicability and reliability of the approach, potentially influencing subsequent research on weight initialization strategies.

Looking forward, this research opens avenues for further exploration in weight initialization for specialized architectures, such as residual networks and those employed in specific domains like NLP or genomics. The LSUV method's compatibility with various activation functions also encourages experimentation in novel network designs incorporating unconventional nonlinearities.

Conclusion

The LSUV initialization presents a compelling balance between simplicity and performance, allowing efficient training of deep CNNs with consistent outcomes across multiple trials and configurations. While the paper refrains from claiming revolutionary breakthroughs, it offers a robust and accessible tool that addresses key challenges in training very deep networks, making it a valuable contribution to the field of deep learning.

PDF Markdown

Related Papers

GitHub

GitHub - ducha-aiki/LSUVinit: Reference caffe implementation of LSUV initialization (113 stars)

Tweets

https://twitter.com/cloneofsimo/status/1846184844725985283

https://twitter.com/giffmana/status/1747175650681561236

https://twitter.com/ducha_aiki/status/1846173519341613453

https://twitter.com/tcpollak/status/1846180319226998833

https://twitter.com/ducha_aiki/status/1790481047592853805

https://twitter.com/tcpollak/status/1747245963087839254