Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks (1602.07868v3)

Published 25 Feb 2016 in cs.LG, cs.AI, and cs.NE

Abstract: We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.

Citations (1,849)

View on Semantic Scholar

Summary

The paper presents weight normalization, a reparameterization that decouples weight vector length from direction to accelerate convergence in deep networks.
It demonstrates improved training efficiency and stability across various tasks, achieving state-of-the-art performance on CIFAR-10 and enhancing models in generative and reinforcement learning.
The method incurs minimal computational overhead and integrates easily with existing architectures, making it effective even in complex models like LSTMs and VAEs.

An Overview of Weight Normalization: Accelerating Training of Deep Neural Networks

In the paper "Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks," Salimans and Kingma of OpenAI present a novel method termed "Weight Normalization." This approach involves reparameterizing weight vectors in neural networks to decouple their lengths from their directions, thereby improving optimization efficiency and speeding up convergence in stochastic gradient descent (SGD) methods. This method, inspired by batch normalization, uniquely avoids dependencies among examples in a minibatch, rendering it applicable to recurrent models like LSTMs and noise-sensitive applications such as reinforcement learning and generative models.

Methodology and Key Concepts

Weight Normalization

Weight normalization reparameterizes each weight vector $w$ in terms of a vector $v$ and a scalar $g$ :

$w = \frac{g}{||v||} v$

Here, $||v||$ denotes the Euclidean norm of $v$ . This reparameterization fixes the weight vector's norm to $g$ , which is independent of the parameters $v$ . The direct benefits include improved conditioning of the gradient, which speeds up the convergence of the optimization process. This is distinctly different from other methods, which normalize weights post-optimization steps.

Gradient Computation

Training under the new parameterization involves computing gradients with respect to $v$ and $g$ :

$\nabla_{g}L = \frac{\nabla_{w}L \cdot v}{||v||}, \quad \nabla_{v}L = \frac{g}{||v||}\nabla_{w}L - \frac{g \nabla_{g}L}{||v||^{2}} v$

This modification has negligible computational overhead and is easily integrated into standard neural network software.

Experimental Validation

The efficacy of weight normalization was tested across various applications, demonstrating consistent performance improvements and faster convergence rates.

Supervised Classification on CIFAR-10

Using a modified ConvPool-CNN-C architecture, the authors compared weight normalization, batch normalization, mean-only batch normalization, and the standard parameterization. The experiments indicated that while batch normalization facilitated faster initial progress, weight normalization eventually reached a similarly optimized performance more efficiently, showing approximately 8.5% test error. Interestingly, combining weight normalization with mean-only batch normalization brought the error rate down to 7.31%, outperforming the baseline batch normalization (8.05%) and establishing a state-of-the-art performance for CIFAR-10 without data augmentation.

Generative Modelling: Convolutional VAE and DRAW

The paper also evaluated weight normalization with convolutional VAEs on MNIST and CIFAR-10 datasets and the DRAW model for MNIST. Both models exhibited more stable training and better optimization convergence with weight normalization. In particular, the DRAW model showed significant convergence speedup without modification to initialization or learning rate, showcasing the practical applicability of weight normalization even in complex, recurrent network structures.

Reinforcement Learning: DQN

For reinforcement learning, applying weight normalization to the DQN framework improved performance in Atari games like Space Invaders and Enduro. Despite batch normalization's instability due to added noise, weight normalization consistently enhanced learning efficiency and stability.

Implications and Future Directions

The implications of weight normalization are substantial for both theoretical and practical advancements. It provides a simple yet powerful tool for improving network optimization across diverse applications, including cases where traditional methods like batch normalization are less effective.

Future research might explore further refinement of initialization methods tailored for weight normalization, as well as extensions to more complex network architectures and training paradigms. Given its minimal computational overhead and easy integration, weight normalization could become a valuable standard in developing future deep learning models.

Conclusion

The introduction of weight normalization represents a significant step forward in the optimization of deep neural networks. By effectively reparameterizing weight vectors, this method enhances convergence speeds and overall performance across supervised, generative, and reinforcement learning tasks. As neural network architectures continue to evolve, weight normalization stands out as a robust and versatile technique for advancing the field of deep learning.

PDF Markdown

Related Papers

Tweets

https://twitter.com/bozavlado/status/1858594294417772806

YouTube

Show All Videos