Fixup Initialization: Residual Learning Without Normalization (1901.09321v2)

Published 27 Jan 2019 in cs.LG, cs.CV, and stat.ML

Abstract: Normalization layers are a staple in state-of-the-art deep neural network architectures. They are widely believed to stabilize training, enable higher learning rate, accelerate convergence and improve generalization, though the reason for their effectiveness is still an active research topic. In this work, we challenge the commonly-held beliefs by showing that none of the perceived benefits is unique to normalization. Specifically, we propose fixed-update initialization (Fixup), an initialization motivated by solving the exploding and vanishing gradient problem at the beginning of training via properly rescaling a standard initialization. We find training residual networks with Fixup to be as stable as training with normalization -- even for networks with 10,000 layers. Furthermore, with proper regularization, Fixup enables residual networks without normalization to achieve state-of-the-art performance in image classification and machine translation.

Citations (334)

View on Semantic Scholar

Summary

The paper introduces Fixup initialization to enable stable training of deep residual networks by carefully rescaling weights to prevent exploding gradients.
Empirical results on CIFAR-10 and ImageNet benchmarks demonstrate that Fixup achieves competitive performance versus traditional normalization layers.
Fixup simplifies network design by eliminating normalization, potentially reducing computational overhead and enhancing model interpretability.

Analysis of Fixup Initialization: Residual Learning Without Normalization

The paper "Fixup Initialization: Residual Learning Without Normalization" presents a rigorous examination of alternative methods for training deep residual networks without relying on normalization layers, which have conventionally been integral to the architectures of state-of-the-art deep learning models. The authors challenge the prevailing understanding that normalization techniques, such as Batch Normalization (BatchNorm), are indispensable for stabilizing training and achieving optimal performance.

Problem Formulation and Methodology

Residual networks, or ResNets, utilize skip connections that facilitate the training of deep networks. Normalization layers have traditionally been thought to play a crucial role in ensuring stable and efficient training of these networks. However, the authors question whether deep residual networks can be trained effectively without normalization and whether these networks can perform on par with their normalized counterparts. They address these points by introducing Fixup initialization, a method designed to ameliorate the exploding gradient problem via careful rescaling of initializations in residual branches.

The paper articulates a formal derivation to demonstrate that the perceived benefits of normalization are not exclusive to it but rather result from handling the gradient dynamics effectively. Specifically, the authors derive lower bounds for the gradient norm of a ResNet at initialization to explain why standard initializations coupled with normalization are essential. They introduce Fixup, emphasizing a novel approach to initialization that ensures stable training even with very deep networks, up to 10,000 layers, without the need for normalization.

Empirical Evaluation

Fixup's efficacy is evaluated across multiple image classification tasks, notably on CIFAR-10 and ImageNet datasets, as well as on machine translation tasks using the Transformer model.

1. Image Classification:

For CIFAR-10, the Fixup method applied to ResNet-110 showed competitive performance compared to models employing BatchNorm, with Fixup achieving a test error rate of 7.24% versus 6.61% with BatchNorm.
On the ImageNet dataset, Fixup was applied to ResNet-50 and ResNet-101 architectures. Despite BatchNorm yielding lower test errors due to its potentially regularizing effects, Fixup effectively trained networks to match the convergence speed and stability seen with normalization when augmented with stronger regularization techniques.

2. Machine Translation:

In the context of the IWSLT German-English and WMT English-German translation tasks, Fixup replaced LayerNorm in Transformers and yielded equivalent or superior performance compared to the baseline models, achieving a BLEU score of 34.5 on IWSLT.

Implications and Future Directions

The substitution of normalization methods with Fixup signifies a significant shift in the conceptual framework for developing and optimizing deep neural networks. The paper's findings suggest that deep networks can potentially be simplified by eliminating normalization layers while retaining comparable performance, thus reducing computational overhead during training and facilitating faster iterations on model architectures.

The implications of removing normalization layers extend to the interpretability and theoretical understanding of neural networks. Fixup provides insights into initialization strategies that alter the landscape of training dynamics, presenting new avenues for research into alternative stabilization techniques that could lead to more efficient deep learning practices.

Future research might delve into better understanding the interplay between Fixup and various types of activation functions, expansion of these techniques to different deep learning domains, and exploration of other architectures that could leverage similar initialization strategies.

In conclusion, the Fixup initialization method provides a compelling alternative to traditional normalization techniques, fostering exploration into new trainable parameters capable of maintaining network functionality and performance while simplifying overall architectural complexity in deep learning applications.

PDF Markdown

Related Papers

YouTube

Show All Videos