- The paper shows that linearized RF and NTK models correspond to fitting degree-ℓ and degree-(ℓ+1) polynomials, respectively, under precise neuron count conditions.
- It examines two distinct regimes—approximation-limited and sample size-limited—demonstrating how parameters like neuron number and sample size govern function complexity.
- The analysis highlights limitations of kernel methods in high dimensions, offering key insights into model expressivity and the practical design of neural network architectures.
Linearized Two-Layers Neural Networks in High Dimension
The paper under consideration explores the behavior of two classes of models for learning an unknown function over the d-dimensional sphere — the random features (RF) model and the neural tangent kernel (NTK) model. These models provide linearized approximations of two-layer neural networks with significant implications for our understanding of high-dimensional function approximation.
Overview
The authors analyze two learning regimes characterized by overparametrization: the approximation-limited regime, where n=∞ and the number of neurons N and dimension d are large but finite; and the sample size-limited regime, where N=∞ and both n and d are large but finite. They show that, in these regimes, the RF and NTK models effectively approximate low-degree polynomials of the input features.
Key Results
- Approximation Error in RF Models:
- For dℓ+δ≤N≤dℓ+1−δ with δ>0, the RF model is equivalent to fitting a polynomial of degree ℓ in the raw features.
- This implies that the choice of neuron number N is pivotal in determining the complexity of functions the model can approximate effectively.
- Approximation Error in NTK Models:
- In similar conditions as RF models (dℓ+δ≤N≤dℓ+1−δ), NTK fits a degree-(ℓ+1) polynomial.
- Highlighting the potential for increased expressivity compared to RF under certain configurations, due to its Taylor expansion leveraging both layers of weights.
- Generalization Error of Kernel Methods:
- Under conditions dℓ+δ≤n≤dℓ+1−δ, Kernel Ridge Regression (KRR) using rotationally invariant kernels aligns exactly with the degree-ℓ polynomial fitting tasks.
- Importantly, even as n grows, the function complexity captured remains bounded by the sample size, illuminating why kernel methods may fail to generalize beyond certain intrinsic dimensionality constraints.
Practical and Theoretical Implications
- On the Performance of Linearizations:
- While neural networks boast universal approximation capabilities theoretically, linearizations (RF and NTK) are bound by polynomial approximations determined by N.
- This delineation prompts critical evaluation of these methods in practice, as data modalities and neuron counts heavily dictate achievable performance.
- Limitations and Capabilities:
- Both RF and NTK models outperform each other in different regimes, yet remain constrained for approximating non-polynomial functions — presenting a fundamental limitation vital for tasks requiring nuanced, high-complexity distributions.
- Kernel Methods:
- The utility of kernel methods demonstrates clear constraints when applied to high-dimensional problems, elucidating why traditional non-parametric approaches may struggle in modern machine learning contexts, where data dimensionality is intrinsically high.
Future Directions
The results presented open several avenues for further exploration. Prominent among them is understanding the practical computation of these linear models in high dimensions while maintaining efficiency. Moreover, investigating the integration of neural architectures beyond simple two-layer constructs into NTK frameworks could yield insights into overcoming the polynomial limitation observed.
The work applies mathematical rigor to bridge gaps in linear approximations and their real-world limits, fostering a nuanced understanding of neural network behavior in the space of over-parameterized systems. As models evolve, these findings will remain a cornerstone in scrutinizing AI models’ underpinnings in theory and application.