Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 50 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 225 tok/s Pro

GPT OSS 120B 450 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Linearized two-layers neural networks in high dimension (1904.12191v3)

Published 27 Apr 2019 in math.ST, cs.LG, and stat.TH

Abstract: We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_i,{\boldsymbol x}i)}{i\le n}$ where ${\boldsymbol x}i$ is a feature vector uniformly distributed on the sphere and $y_i=f{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d^{\ell + \delta} \le N\le d^{{\ell+1-\delta}$} for small $\delta > 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d^{\ell + \delta} \le n \le d^{\ell +1-\delta}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

Citations (233)

View on Semantic Scholar

Summary

The paper shows that linearized RF and NTK models correspond to fitting degree-ℓ and degree-(ℓ+1) polynomials, respectively, under precise neuron count conditions.
It examines two distinct regimes—approximation-limited and sample size-limited—demonstrating how parameters like neuron number and sample size govern function complexity.
The analysis highlights limitations of kernel methods in high dimensions, offering key insights into model expressivity and the practical design of neural network architectures.

Linearized Two-Layers Neural Networks in High Dimension

The paper under consideration explores the behavior of two classes of models for learning an unknown function over the $d$ -dimensional sphere — the random features (RF) model and the neural tangent kernel (NTK) model. These models provide linearized approximations of two-layer neural networks with significant implications for our understanding of high-dimensional function approximation.

Overview

The authors analyze two learning regimes characterized by overparametrization: the approximation-limited regime, where $n=\infty$ and the number of neurons $N$ and dimension $d$ are large but finite; and the sample size-limited regime, where $N=\infty$ and both $n$ and $d$ are large but finite. They show that, in these regimes, the RF and NTK models effectively approximate low-degree polynomials of the input features.

Key Results

Approximation Error in RF Models:
- For $d^{\ell + \delta} \le N \le d^{\ell+1-\delta}$ with $\delta > 0$ , the RF model is equivalent to fitting a polynomial of degree $\ell$ in the raw features.
- This implies that the choice of neuron number $N$ is pivotal in determining the complexity of functions the model can approximate effectively.
Approximation Error in NTK Models:
- In similar conditions as RF models ( $d^{\ell + \delta} \le N \le d^{\ell+1-\delta}$ ), NTK fits a degree- $(\ell+1)$ polynomial.
- Highlighting the potential for increased expressivity compared to RF under certain configurations, due to its Taylor expansion leveraging both layers of weights.
Generalization Error of Kernel Methods:
- Under conditions $d^{\ell + \delta} \le n \le d^{\ell+1-\delta}$ , Kernel Ridge Regression (KRR) using rotationally invariant kernels aligns exactly with the degree- $\ell$ polynomial fitting tasks.
- Importantly, even as $n$ grows, the function complexity captured remains bounded by the sample size, illuminating why kernel methods may fail to generalize beyond certain intrinsic dimensionality constraints.

Practical and Theoretical Implications

On the Performance of Linearizations:
- While neural networks boast universal approximation capabilities theoretically, linearizations (RF and NTK) are bound by polynomial approximations determined by $N$ .
- This delineation prompts critical evaluation of these methods in practice, as data modalities and neuron counts heavily dictate achievable performance.
Limitations and Capabilities:
- Both RF and NTK models outperform each other in different regimes, yet remain constrained for approximating non-polynomial functions — presenting a fundamental limitation vital for tasks requiring nuanced, high-complexity distributions.
Kernel Methods:
- The utility of kernel methods demonstrates clear constraints when applied to high-dimensional problems, elucidating why traditional non-parametric approaches may struggle in modern machine learning contexts, where data dimensionality is intrinsically high.

Future Directions

The results presented open several avenues for further exploration. Prominent among them is understanding the practical computation of these linear models in high dimensions while maintaining efficiency. Moreover, investigating the integration of neural architectures beyond simple two-layer constructs into NTK frameworks could yield insights into overcoming the polynomial limitation observed.

The work applies mathematical rigor to bridge gaps in linear approximations and their real-world limits, fostering a nuanced understanding of neural network behavior in the space of over-parameterized systems. As models evolve, these findings will remain a cornerstone in scrutinizing AI models’ underpinnings in theory and application.