Geometric compression of invariant manifolds in neural nets
(2007.11471)Abstract
We study how neural networks compress uninformative input space in models where data lie in $d$ dimensions, but whose label only vary within a linear manifold of dimension $d\parallel < d$. We show that for a one-hidden layer network initialized with infinitesimal weights (i.e. in the feature learning regime) trained with gradient descent, the first layer of weights evolve to become nearly insensitive to the $d\perp=d-d\parallel$ uninformative directions. These are effectively compressed by a factor $\lambda\sim \sqrt{p}$, where $p$ is the size of the training set. We quantify the benefit of such a compression on the test error $\epsilon$. For large initialization of the weights (the lazy training regime), no compression occurs and for regular boundaries separating labels we find that $\epsilon \sim p{-\beta}$, with $\beta\text{Lazy} = d / (3d-2)$. Compression improves the learning curves so that $\beta\text{Feature} = (2d-1)/(3d-2)$ if $d\parallel = 1$ and $\beta\text{Feature} = (d + d\perp/2)/(3d-2)$ if $d\parallel > 1$. We test these predictions for a stripe model where boundaries are parallel interfaces ($d\parallel=1$) as well as for a cylindrical boundary ($d\parallel=2$). Next we show that compression shapes the Neural Tangent Kernel (NTK) evolution in time, so that its top eigenvectors become more informative and display a larger projection on the labels. Consequently, kernel learning with the frozen NTK at the end of training outperforms the initial NTK. We confirm these predictions both for a one-hidden layer FC network trained on the stripe model and for a 16-layers CNN trained on MNIST, for which we also find $\beta\text{Feature}>\beta_\text{Lazy}$.
We're not able to analyze this paper right now due to high demand.
Please check back later (sorry!).
Generate a summary of this paper on our Pro plan:
We ran into a problem analyzing this paper.