A Convergence Theory for Deep Learning via Over-Parameterization (1811.03962v5)

Published 9 Nov 2018 in cs.LG, cs.DS, cs.NE, math.OC, and stat.ML

Abstract: Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

Citations (1,385)

View on Semantic Scholar

Summary

The paper establishes that over-parameterized networks achieve an ε-error global minimum in polynomial time using gradient descent under minimal assumptions.
It demonstrates that both GD and SGD exploit almost convexity and semi-smoothness to navigate the complex, non-convex optimization landscape.
The study validates NTK equivalence for practical network widths and extends the results to architectures like CNNs and ResNets, reinforcing theoretical support for over-parameterization.

Analyzing the Convergence Theory for Deep Learning Through Over-Parameterization

The paper, "A Convergence Theory for Deep Learning via Over-Parameterization," addresses a fundamental issue in the optimization of deep neural networks (DNNs), specifically their ability to achieve global minima efficiently using first-order methods like gradient descent (GD) and stochastic gradient descent (SGD). This study is notably relevant given the empirical success of DNNs despite their inherent non-convex optimization landscapes. The authors offer theoretical guarantees under minimal assumptions: non-degenerate input data and sufficiently large network width, thereby formalizing the widely recognized heuristic that increased over-parameterization aids in efficient training.

Key Contributions and Formal Results

The authors present a series of theorems demonstrating that under the condition that network width $m$ is polynomial in the number of layers $L$ and the number of samples $n$ , both GD and SGD can find an $\epsilon$ -error global minimum in polynomial time. The main results are:

Gradient Descent Convergence: For networks initialized randomly, GD with an appropriately chosen learning rate can find an $\epsilon$ -error solution in $O(\text{poly}(n, L, 1/\epsilon))$ iterations.
Stochastic Gradient Descent Convergence: Similarly, SGD with suitable mini-batch size and learning rates can achieve comparable performance with high probability, converging in $O(\text{poly}(n, L, 1/\epsilon))$ iterations.
Generalization to Loss Functions and Architectures: The results extend beyond the $L_2$ regression loss to general Lipschitz-smooth loss functions, and architectural variations including convolutional neural networks (CNNs) and residual neural networks (ResNets).

Analytical Framework

Almost Convexity and Semi-Smoothness

The foundation of this study lies in the characterization of the optimization landscape near the random initialization. The authors derive two crucial properties:

Almost Convexity: Within a large neighborhood of the initialization, the gradient norm of the objective function is bounded below by a function of the objective value itself. Mathematically, $\|\nabla F(W)\| \geq \Omega(F(W))$ , indicating the absence of spurious local minima or saddle points in this region.
Semi-Smoothness: The objective function $F(W)$ satisfies a condition slightly weaker than Lipschitz smoothness. Specifically, for weights $W$ and perturbations $\Delta$ , $F(W + \Delta)$ can be upper-bounded by a combination of linear and quadratic terms involving $\Delta$ . This result ensures that the first-order Taylor expansion can reliably predict the decrease in $F(W)$ during the training steps.

Neural Tangent Kernel (NTK) Equivalence

The equivalence to the Neural Tangent Kernel (NTK) theory is another cornerstone of this work. The NTK posits that for over-parameterized networks, the optimization dynamics can be approximated by a linear model derived from the first-order Taylor expansion around the initialization. The authors strengthened this comparison by showing that this equivalence holds not just in the infinite-width regime but also for polynomially-large widths.

Numerical Verification and Empirical Observations

The theory presented is corroborated by empirical observations of gradient norms and objective values during the training of various network architectures on standard datasets like CIFAR-10 and CIFAR-100. These plots reveal that the gradient direction suffices to decrease the objective significantly, supporting the non-degenerate landscape assumptions posed by the theoretical results.

Implications and Future Directions

The implications of this study are manifold:

Theoretical Justification for Over-Parameterization: The results lend rigorous support to the empirical practice of using very wide networks to facilitate training.
Extension to Structured Data: While the current results assume non-degenerate inputs, extending these guarantees to structured or correlated data distributions remains an open question.
Further Generalization: There is potential future work in extending these results to more complex architectures and other types of loss functions, further bridging the gap between theory and practice in deep learning optimization.

In conclusion, this paper provides substantial theoretical advancements in understanding the convergence properties of over-parameterized DNNs. By framing the optimization landscape as almost convex and semi-smooth near initialization and proving equivalence to NTK in practical regimes, the authors offer a comprehensive narrative that underpins successful deep learning practices.