Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks (1811.08888v3)

Published 21 Nov 2018 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our theoretical results shed light on understanding the optimization for deep learning, and pave the way for studying the optimization dynamics of training modern deep neural networks.

Citations (442)

View on Semantic Scholar

Summary

The paper demonstrates that with Gaussian initialization, both GD and SGD converge to global minima in deep ReLU networks.
It reveals that the required number of hidden nodes per layer grows polynomially with training data, supporting effective over-parameterization.
The study highlights that favorable local curvature after initialization ensures zero training error within polynomially bounded iterations.

Overview of Stochastic Gradient Descent and Deep ReLU Networks

This paper investigates the theoretical aspects of training over-parameterized deep ReLU neural networks using gradient-based methods, specifically Gradient Descent (GD) and Stochastic Gradient Descent (SGD). The authors focus on binary classification and demonstrate that these algorithms can achieve global minima for the training loss, underlining the efficacy of over-parameterization and random initialization.

Key Findings

The study presents several groundbreaking insights into the dynamics and convergence properties of GD and SGD:

Initialization and Convergence:
- Utilizing Gaussian random initialization ensures the networks begin optimally in a favorable region of the parameter space.
- Both GD and SGD can find global minima with appropriate initialization and over-parameterization. The results cover a broad range of loss functions, moving beyond the traditional settings of least squares and cross-entropy.
Over-parameterization:
- The theory confirms that the number of hidden nodes needed per layer is polynomial in terms of training data size and separation margin. This supports the empirical observations that larger networks enhance convergence without explicit regularization.
Curvature Properties:
- The paper elucidates that the empirical loss function exhibits favorable local curvature properties within a small perturbation region after initialization. This leads to the global convergence of GD and SGD.
Quantitative Results:
- With hidden node numbers at $\tilde \Omega(\text{poly}(n,\phi^{-1},L))$ , zero training error can be achieved.
- The number of iterations required is also polynomially bounded relative to data characteristics.

Implications

Practical Implications

The results provide a rigorous foundation for the practical success of deep ReLU networks. Understanding that GD and SGD can achieve global minima has significant implications for neural architecture design and initialization strategies.

Network Design: The insights encourage designing networks with redundancy (over-parameterization) to exploit the dynamics of SGD efficiently.
Initialization Techniques: Reinforces the importance of random (Gaussian) initialization in practical applications, aligning with theoretical guarantees.

Theoretical Implications

From a theoretical perspective, the paper paves the way for more nuanced explorations of optimization dynamics in deep networks. It challenges future work to refine the dependence on network depth $L$ and address the trade-offs between network size and convergence speed.

Future Directions

Given the recursive and expansive nature of AI research, several future paths can be envisioned:

Refinement of Depth Dependence: Further work could aim to reduce the current polynomial dependence on depth to improve practical tractability.
Extension to Other Architectures: Investigating whether similar theoretical guarantees hold for architectures beyond fully-connected ReLU networks, such as convolutional or recurrent networks, is a promising area.
Dynamic Adaptation: Research could explore adaptive versions of SGD that dynamically adjust learning rates based on the curvature properties.

In summary, this paper makes a substantial contribution to both the understanding of deep learning optimization and practical efficacy by bridging theory and empirical successes. It provides a robust framework for analyzing and improving the convergence of deep neural networks using traditional gradient-based methods.