Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 39 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 12 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

On Learning Over-parameterized Neural Networks: A Functional Approximation Perspective (1905.10826v2)

Published 26 May 2019 in cs.LG and stat.ML

Abstract: We consider training over-parameterized two-layer neural networks with Rectified Linear Unit (ReLU) using gradient descent (GD) method. Inspired by a recent line of work, we study the evolutions of network prediction errors across GD iterations, which can be neatly described in a matrix form. When the network is sufficiently over-parameterized, these matrices individually approximate {\em an} integral operator which is determined by the feature vector distribution $\rho$ only. Consequently, GD method can be viewed as {\em approximately} applying the powers of this integral operator on the underlying/target function $f*$ that generates the responses/labels. We show that if $f*$ admits a low-rank approximation with respect to the eigenspaces of this integral operator, then the empirical risk decreases to this low-rank approximation error at a linear rate which is determined by $f*$ and $\rho$ only, i.e., the rate is independent of the sample size $n$. Furthermore, if $f*$ has zero low-rank approximation error, then, as long as the width of the neural network is $\Omega(n\log n)$, the empirical risk decreases to $\Theta(1/\sqrt{n})$. To the best of our knowledge, this is the first result showing the sufficiency of nearly-linear network over-parameterization. We provide an application of our general results to the setting where $\rho$ is the uniform distribution on the spheres and $f*$ is a polynomial. Throughout this paper, we consider the scenario where the input dimension $d$ is fixed.

Citations (52)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)