Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Linearized two-layers neural networks in high dimension (1904.12191v3)

Published 27 Apr 2019 in math.ST, cs.LG, and stat.TH

Abstract: We consider the problem of learning an unknown function $f_{\star}$ on the $d$-dimensional sphere with respect to the square loss, given i.i.d. samples ${(y_i,{\boldsymbol x}i)}{i\le n}$ where ${\boldsymbol x}i$ is a feature vector uniformly distributed on the sphere and $y_i=f{\star}({\boldsymbol x}_i)+\varepsilon_i$. We study two popular classes of models that can be regarded as linearizations of two-layers neural networks around a random initialization: the random features model of Rahimi-Recht (RF); the neural tangent kernel model of Jacot-Gabriel-Hongler (NT). Both these approaches can also be regarded as randomized approximations of kernel ridge regression (with respect to different kernels), and enjoy universal approximation properties when the number of neurons $N$ diverges, for a fixed dimension $d$. We consider two specific regimes: the approximation-limited regime, in which $n=\infty$ while $d$ and $N$ are large but finite; and the sample size-limited regime in which $N=\infty$ while $d$ and $n$ are large but finite. In the first regime we prove that if $d{\ell + \delta} \le N\le d{\ell+1-\delta}$ for small $\delta > 0$, then \RF\, effectively fits a degree-$\ell$ polynomial in the raw features, and \NT\, fits a degree-$(\ell+1)$ polynomial. In the second regime, both RF and NT reduce to kernel methods with rotationally invariant kernels. We prove that, if the number of samples is $d{\ell + \delta} \le n \le d{\ell +1-\delta}$, then kernel methods can fit at most a a degree-$\ell$ polynomial in the raw features. This lower bound is achieved by kernel ridge regression. Optimal prediction error is achieved for vanishing ridge regularization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Behrooz Ghorbani (18 papers)
  2. Song Mei (56 papers)
  3. Theodor Misiakiewicz (24 papers)
  4. Andrea Montanari (165 papers)
Citations (233)

Summary

  • The paper shows that linearized RF and NTK models correspond to fitting degree-ℓ and degree-(ℓ+1) polynomials, respectively, under precise neuron count conditions.
  • It examines two distinct regimes—approximation-limited and sample size-limited—demonstrating how parameters like neuron number and sample size govern function complexity.
  • The analysis highlights limitations of kernel methods in high dimensions, offering key insights into model expressivity and the practical design of neural network architectures.

Linearized Two-Layers Neural Networks in High Dimension

The paper under consideration explores the behavior of two classes of models for learning an unknown function over the dd-dimensional sphere — the random features (RF) model and the neural tangent kernel (NTK) model. These models provide linearized approximations of two-layer neural networks with significant implications for our understanding of high-dimensional function approximation.

Overview

The authors analyze two learning regimes characterized by overparametrization: the approximation-limited regime, where n=n=\infty and the number of neurons NN and dimension dd are large but finite; and the sample size-limited regime, where N=N=\infty and both nn and dd are large but finite. They show that, in these regimes, the RF and NTK models effectively approximate low-degree polynomials of the input features.

Key Results

  1. Approximation Error in RF Models:
    • For d+δNd+1δd^{\ell + \delta} \le N \le d^{\ell+1-\delta} with δ>0\delta > 0, the RF model is equivalent to fitting a polynomial of degree \ell in the raw features.
    • This implies that the choice of neuron number NN is pivotal in determining the complexity of functions the model can approximate effectively.
  2. Approximation Error in NTK Models:
    • In similar conditions as RF models (d+δNd+1δd^{\ell + \delta} \le N \le d^{\ell+1-\delta}), NTK fits a degree-(+1)(\ell+1) polynomial.
    • Highlighting the potential for increased expressivity compared to RF under certain configurations, due to its Taylor expansion leveraging both layers of weights.
  3. Generalization Error of Kernel Methods:
    • Under conditions d+δnd+1δd^{\ell + \delta} \le n \le d^{\ell+1-\delta}, Kernel Ridge Regression (KRR) using rotationally invariant kernels aligns exactly with the degree-\ell polynomial fitting tasks.
    • Importantly, even as nn grows, the function complexity captured remains bounded by the sample size, illuminating why kernel methods may fail to generalize beyond certain intrinsic dimensionality constraints.

Practical and Theoretical Implications

  1. On the Performance of Linearizations:
    • While neural networks boast universal approximation capabilities theoretically, linearizations (RF and NTK) are bound by polynomial approximations determined by NN.
    • This delineation prompts critical evaluation of these methods in practice, as data modalities and neuron counts heavily dictate achievable performance.
  2. Limitations and Capabilities:
    • Both RF and NTK models outperform each other in different regimes, yet remain constrained for approximating non-polynomial functions — presenting a fundamental limitation vital for tasks requiring nuanced, high-complexity distributions.
  3. Kernel Methods:
    • The utility of kernel methods demonstrates clear constraints when applied to high-dimensional problems, elucidating why traditional non-parametric approaches may struggle in modern machine learning contexts, where data dimensionality is intrinsically high.

Future Directions

The results presented open several avenues for further exploration. Prominent among them is understanding the practical computation of these linear models in high dimensions while maintaining efficiency. Moreover, investigating the integration of neural architectures beyond simple two-layer constructs into NTK frameworks could yield insights into overcoming the polynomial limitation observed.

The work applies mathematical rigor to bridge gaps in linear approximations and their real-world limits, fostering a nuanced understanding of neural network behavior in the space of over-parameterized systems. As models evolve, these findings will remain a cornerstone in scrutinizing AI models’ underpinnings in theory and application.