Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Published 26 Sep 2019 in cs.LG, math.OC, and stat.ML | (1909.12292v4)

Abstract: Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size $n$, the (inverse) target error $1/\epsilon$, and the (inverse) failure probability $1/\delta$. This work shows that $\widetilde{\Theta}(1/\epsilon)$ iterations of gradient descent with $\widetilde{\Omega}(1/\epsilon^2)$ training examples on two-layer ReLU networks of any width exceeding $\mathrm{polylog}(n,1/\epsilon,1/\delta)$ suffice to achieve a test misclassification error of $\epsilon$. We also prove that stochastic gradient descent can achieve $\epsilon$ test error with polylogarithmic width and $\widetilde{\Theta}(1/\epsilon)$ samples. The analysis relies upon the separation margin of the limiting kernel, which is guaranteed positive, can distinguish between true labels and random labels, and can give a tight sample-complexity analysis in the infinite-width setting

Abstract PDF Upgrade to Chat

Citations (172)

View on Semantic Scholar

Summary

The paper shows that gradient descent on a two-layer ReLU network can achieve arbitrarily small test error using polylogarithmic width instead of polynomial dependencies.
It establishes that attaining an ε test error requires only 1/ε² samples and 1/ε iterations, aligning practical performance with theoretical bounds.
The work employs martingale techniques to extend the analysis to stochastic gradient descent, confirming limited width suffices for online learning.

Polylogarithmic Width in Gradient Descent on Shallow ReLU Networks

The paper "Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks" by Ziwei Ji and Matus Telgarsky addresses a critical theoretical concern in machine learning, specifically the optimization and generalization processes of neural networks. This work explores the requirements for achieving arbitrarily low test errors using shallow ReLU networks, with gradient descent as the optimization method.

Summary of the Paper

The analysis presented in this research departs from prior assertions which suggested that overparameterized networks could attain low test errors; however, these assertions were contingent upon the network width being polynomial concerning either the sample size $n$ , the inverse target error $\nicefrac{1}{\epsilon}$, or the inverse failure probability $\nicefrac{1}{\delta}$. Ji and Telgarsky propose that polylogarithmic width suffices for gradient descent to reach a misclassification test error of $\epsilon$ on two-layer ReLU networks, significantly reducing the overly large width requirements posited by earlier research.

Key Contributions

Reduced Width Requirement: The authors demonstrate that with gradient descent, a two-layer ReLU network can achieve classification error $\epsilon$ with a width that is polylogarithmic in $n$ , $\nicefrac{1}{\delta}$, and $\nicefrac{1}{\epsilon}$. This is a substantial reduction from prior polynomial width dependencies. The theoretical contribution includes bounds that are dependent on the separation margin of the limiting kernel, specifically showing the alignment between practice and theory.
Test Error Analysis: The work provides rigorous proof that gradient descent can achieve an $\epsilon$ test error using $\nicefrac{1}{\epsilon^2}$ samples and $\nicefrac{1}{\epsilon}$ iterations. This contrasts with stochastic gradient descent, which can achieve the same test error under the same sampling conditions but with a polylogarithmic width.
Empirical Risk Minimization: The paper reinforces the use of gradient descent in empirical risk minimization, asserting that optimization and generalization can co-occur even in the regime of polylogarithmic width, using $\nicefrac{1}{\epsilon}$ iterations.
Martingale Techniques: It employs martingale theory to extend the analysis to stochastic gradient descent, proving that a polylogarithmic width suffices for online learning with $\nicefrac{1}{\epsilon}$ samples.
Separation Margin: Explores the separation margin, which encapsulates the difficulty of classification problems, and its role in tightly bounding sample complexity in infinite-width scenarios.

Implications and Future Directions

The implications of this paper are significant for both theoretical and practical spheres. On the theoretical side, it provides a refined understanding of the optimization landscape of shallow neural networks, considerably lowering the computational burden for achieving desired test errors. Practically, this research offers a pathway to more efficient network designs that could result in reduced computational resource requirements and faster training times, bridging the gap between empirical success and theoretical backing.

Future directions could explore extending these results to deeper networks, convolutional architectures, and other activation functions beyond ReLU. Additionally, investigating the properties of the separation margin in practical datasets could yield insights into working with non-synthetic data scenarios. Interesting open questions remain about the application of these findings to other tasks such as regression or broader AI challenges, requiring further exploration into whether similar polylogarithmic width sufficiencies apply.

In summary, Ji and Telgarsky's paper presents compelling evidence that the excessive width requirements previously outlined can be significantly reduced, addressing an important discrepancy between theory and empirical evidence in deep learning.

Markdown