Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A simple connection from loss flatness to compressed representations in neural networks (2310.01770v3)

Published 3 Oct 2023 in cs.LG and cs.AI

Abstract: The generalization capacity of deep neural networks has been studied in a variety of ways, including at least two distinct categories of approaches: one based on the shape of the loss landscape in parameter space, and the other based on the structure of the representation manifold in feature space (that is, in the space of unit activities). Although these two approaches are related, they are rarely studied together explicitly. Here, we present an analysis that bridges this gap. We show that in the final phase of learning in deep neural networks, the compression of the manifold of neural representations correlates with the flatness of the loss around the minima explored by SGD. This correlation is predicted by a relatively simple mathematical relationship: a flatter loss corresponds to a lower upper bound on the compression metrics of neural representations. Our work builds upon the linear stability insight by Ma and Ying, deriving inequalities between various compression metrics and quantities involving sharpness. Empirically, our derived inequality predicts a consistently positive correlation between representation compression and loss sharpness in multiple experimental settings. Overall, we advance a dual perspective on generalization in neural networks in both parameter and feature space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011, 2023.
  2. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  3. Nearest class-center simplification through intermediate layers. In Topological, Algebraic and Geometric Learning Workshops 2022, pp.  37–47. PMLR, 2022.
  4. Reverse engineering self-supervised learning. arXiv preprint arXiv:2305.15614, 2023.
  5. Analysis of the generalization error: Empirical risk minimization over deep artificial neural networks overcomes the curse of dimensionality in the numerical approximation of black–scholes partial differential equations. SIAM Journal on Mathematics of Data Science, 2(3):631–657, 2020. Publisher: SIAM.
  6. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process, 2020. URL http://arxiv.org/abs/1904.09080.
  7. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  8. Separability and geometry of object manifolds in deep neural networks. Nature communications, 11(1):746, 2020. Publisher: Nature Publishing Group UK London.
  9. Flat minima generalize for low-rank matrix recovery, 2023.
  10. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. PMLR, 2017.
  11. Large margin deep networks for classification. Advances in neural information processing systems, 31, 2018.
  12. Gradient-based learning drives robust representations in recurrent neural networks by balancing compression and expansion. Nature Machine Intelligence, 4(6):564–573, 2022. Publisher: Nature Publishing Group UK London.
  13. Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron, 110(7):1258–1270, 2022.
  14. Comparative generalization bounds for deep neural networks. Transactions on Machine Learning Research, 2023.
  15. A theory of multineuronal dimensionality, dynamics and measurement. BioRxiv, pp.  214262, 2017.
  16. The inductive bias of flatness regularization for deep matrix factorization, 2023.
  17. Landscape and training regimes in deep learning. Physics Reports, 924:1–18, 2021. ISSN 0370-1573. doi: 10.1016/j.physrep.2021.04.001. URL https://www.sciencedirect.com/science/article/pii/S0370157321001290.
  18. Three Factors Influencing Minima in SGD, September 2018. URL http://arxiv.org/abs/1711.04623. arXiv:1711.04623 [cs, stat].
  19. Neural collapse: A review on modelling principles and generalization. arXiv preprint arXiv:2206.04041, 2022.
  20. What happens after SGD reaches zero loss? –a mathematical framework, 2022. URL http://arxiv.org/abs/2110.06914.
  21. Optimal degrees of synaptic connectivity. Neuron, 93(5):1153–1164, 2017.
  22. On linear stability of SGD and input-smoothness of neural networks, 2021. URL http://arxiv.org/abs/2105.13462.
  23. Implicit bias of the step size in linear diagonal neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  16270–16295. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/nacson22a.html.
  24. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  25. Feature learning in deep classifiers through intermediate neural collapse. In International Conference on Machine Learning, pp. 28729–28745. PMLR, 2023.
  26. Representational drift as a result of implicit regularization, 2023. URL https://www.biorxiv.org/content/10.1101/2023.05.04.539512v3. Pages: 2023.05.04.539512 Section: New Results.
  27. Dimensionality compression and expansion in deep neural networks. arXiv preprint arXiv:1906.00443, 2019.
  28. A scale-dependent measure of system dimensionality. Patterns, 3(8), 2022.
  29. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
  30. Very deep convolutional networks for large-scale image recognition, 2015.
  31. Deep learning and the information bottleneck principle, 2015. URL http://arxiv.org/abs/1503.02406.
  32. Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization, 2023.
  33. How SGD Selects the Global Minima in Over-parameterized Learning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://papers.nips.cc/paper_files/paper/2018/hash/6651526b6fb8f29a00507de6a49ce30f-Abstract.html.
  34. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima, January 2021. URL http://arxiv.org/abs/2002.03495. arXiv:2002.03495 [cs, stat].
  35. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023. doi: 10.1103/PhysRevLett.130.237101. URL https://link.aps.org/doi/10.1103/PhysRevLett.130.237101. Publisher: American Physical Society.
  36. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects, June 2019. URL http://arxiv.org/abs/1803.00195. arXiv:1803.00195 [cs, stat].
  37. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.

Summary

We haven't generated a summary for this paper yet.