A simple connection from loss flatness to compressed neural representations (2310.01770v4)
Abstract: Sharpness, a geometric measure in the parameter space that reflects the flatness of the loss landscape, has long been studied for its potential connections to neural network behavior. While sharpness is often associated with generalization, recent work highlights inconsistencies in this relationship, leaving its true significance unclear. In this paper, we investigate how sharpness influences the local geometric features of neural representations in feature space, offering a new perspective on its role. We introduce this problem and study three measures for compression: the Local Volumetric Ratio (LVR), based on volume compression, the Maximum Local Sensitivity (MLS), based on sensitivity to input changes, and the Local Dimensionality, based on how uniform the sensitivity is on different directions. We show that LVR and MLS correlate with the flatness of the loss around the local minima; and that this correlation is predicted by a relatively simple mathematical relationship: a flatter loss corresponds to a lower upper bound on the compression metrics of neural representations. Our work builds upon the linear stability insight by Ma and Ying, deriving inequalities between various compression metrics and quantities involving sharpness. Our inequalities readily extend to reparametrization-invariant sharpness as well. Through empirical experiments on various feedforward, convolutional, and transformer architectures, we find that our inequalities predict a consistently positive correlation between local representation compression and sharpness.
- A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011, 2023.
- Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Nearest class-center simplification through intermediate layers. In Topological, Algebraic and Geometric Learning Workshops 2022, pp. 37–47. PMLR, 2022.
- Reverse engineering self-supervised learning. arXiv preprint arXiv:2305.15614, 2023.
- Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of black–scholes partial differential equations. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020. Publisher: SIAM.
- Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process, 2020. URL http://arxiv.org/abs/1904.09080.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- Separability and geometry of object manifolds in deep neural networks. Nature communications, 11(1):746, 2020. Publisher: Nature Publishing Group UK London.
- Flat minima generalize for low-rank matrix recovery, 2023.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
- Large margin deep networks for classification. Advances in neural information processing systems, 31, 2018.
- Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nature Machine Intelligence, 4(6):564–573, 2022. Publisher: Nature Publishing Group UK London.
- Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron, 110(7):1258–1270, 2022.
- Comparative generalization bounds for deep neural networks. Transactions on Machine Learning Research, 2023.
- A theory of multineuronal dimensionality, dynamics and measurement. BioRxiv, pp. 214262, 2017.
- The inductive bias of flatness regularization for deep matrix factorization, 2023.
- Landscape and training regimes in deep learning. Physics Reports, 924:1–18, 2021. ISSN 0370-1573. doi: 10.1016/j.physrep.2021.04.001. URL https://www.sciencedirect.com/science/article/pii/S0370157321001290.
- Three Factors Influencing Minima in SGD, September 2018. URL http://arxiv.org/abs/1711.04623. arXiv:1711.04623 [cs, stat].
- Neural collapse: A review on modelling principles and generalization. arXiv preprint arXiv:2206.04041, 2022.
- What happens after SGD reaches zero loss? –a mathematical framework, 2022. URL http://arxiv.org/abs/2110.06914.
- Optimal degrees of synaptic connectivity. Neuron, 93(5):1153–1164, 2017.
- On linear stability of SGD and input-smoothness of neural networks, 2021. URL http://arxiv.org/abs/2105.13462.
- Implicit bias of the step size in linear diagonal neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 16270–16295. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nacson22a.html.
- Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
- Feature learning in deep classifiers through intermediate neural collapse. In International Conference on Machine Learning, pp. 28729–28745. PMLR, 2023.
- Representational drift as a result of implicit regularization, 2023. URL https://www.biorxiv.org/content/10.1101/2023.05.04.539512v3. Pages: 2023.05.04.539512 Section: New Results.
- Dimensionality compression and expansion in deep neural networks. arXiv preprint arXiv:1906.00443, 2019.
- A scale-dependent measure of system dimensionality. Patterns, 3(8), 2022.
- Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
- Very deep convolutional networks for large-scale image recognition, 2015.
- Deep learning and the information bottleneck principle, 2015. URL http://arxiv.org/abs/1503.02406.
- Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization, 2023.
- How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/6651526b6fb8f29a00507de6a49ce30f-Abstract.html.
- A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima, January 2021. URL http://arxiv.org/abs/2002.03495. arXiv:2002.03495 [cs, stat].
- Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023. doi: 10.1103/PhysRevLett.130.237101. URL https://link.aps.org/doi/10.1103/PhysRevLett.130.237101. Publisher: American Physical Society.
- The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects, June 2019. URL http://arxiv.org/abs/1803.00195. arXiv:1803.00195 [cs, stat].
- A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.