Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hidden Minima in Two-Layer ReLU Networks (2312.16819v2)

Published 28 Dec 2023 in cs.LG, math.OC, and stat.ML

Abstract: The optimization problem associated to fitting two-layer ReLU networks having $d$~inputs, $k$~neurons, and labels generated by a target network, is considered. Two types of infinite families of spurious minima, giving one minimum per $d$, were recently found. The loss at minima belonging to the first type converges to zero as $d$ increases. In the second type, the loss remains bounded away from zero. That being so, how may one avoid minima belonging to the latter type? Fortunately, such minima are never detected by standard optimization methods. Motivated by questions concerning the nature of this phenomenon, we develop methods to study distinctive analytic properties of hidden minima. By existing analyses, the Hessian spectrum of both types agree modulo $O(d{-1/2})$-terms -- not promising. Thus, rather, our investigation proceeds by studying curves along which the loss is minimized or maximized, generally referred to as tangency arcs. We prove that apparently far removed group representation-theoretic considerations concerning the arrangement of subspaces invariant to the action of subgroups of $S_d$, the symmetry group over $d$ symbols, relative to ones fixed by the action yield a precise description of all finitely many admissible types of tangency arcs. The general results used for the loss function reveal that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d{-1/2})$-eigenvalue terms absent in previous work, indicating in particular the subtlety of the analysis. The theoretical results, stated and proved for o-minimal structures, show that the set comprising all tangency arcs is topologically sufficiently tame to enable a numerical construction of tangency arcs and so compare how minima, both types, are positioned relative to adjacent critical points.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. A. Brutzkus and A. Globerson, “Globally optimal gradient descent for a convnet with gaussian inputs,” in International conference on machine learning, PMLR, 2017.
  2. L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,” Advances in neural information processing systems, 2018.
  3. M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018.
  4. S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,” Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018.
  5. S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborová, “Generalisation dynamics of online learning in over-parameterised neural networks,” arXiv preprint arXiv:1901.09085, 2019.
  6. Y. Tian, “Student specialization in deep rectified networks with finite width and input dimension,” in International Conference on Machine Learning, PMLR, 2020.
  7. I. M. Safran, G. Yehudai, and O. Shamir, “The effects of mild over-parameterization on the optimization landscape of shallow relu neural networks,” in Conference on Learning Theory, PMLR, 2021.
  8. A. Blum and R. L. Rivest, “Training a 3-node neural network is np-complete,” in Advances in neural information processing systems, pp. 494–501, 1989.
  9. A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz, “SGD learns over-parameterized networks that provably generalize on linearly separable data,” in 6th International Conference on Learning Representations, ICLR 2018.
  10. O. Shamir, “Distribution-specific hardness of learning neural networks,” The Journal of Machine Learning Research, vol. 19, no. 1, pp. 1135–1163, 2018.
  11. Y. Arjevani and M. Field, “On the principle of least symmetry breaking in shallow relu models,” arXiv preprint arXiv:1912.11939, 2019.
  12. S. Du, J. Lee, Y. Tian, A. Singh, and B. Poczos, “Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima,” in International Conference on Machine Learning, PMLR, 2018.
  13. K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon, “Recovery guarantees for one-hidden-layer neural networks,” in International conference on machine learning, PMLR, 2017.
  14. Y. Li and Y. Yuan, “Convergence analysis of two-layer neural networks with relu activation,” in Advances in Neural Information Processing Systems, pp. 597–607, 2017.
  15. Y. Tian, “An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis,” in International conference on machine learning, PMLR, 2017.
  16. R. Ge, J. D. Lee, and T. Ma, “Learning one-hidden-layer neural networks with landscape design,” in 6th International Conference on Learning Representations, ICLR 2018.
  17. B. Aubin, A. Maillard, J. Barbier, F. Krzakala, N. Macris, and L. Zdeborová, “The committee machine: Computational to statistical gaps in learning a two-layers neural network,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124023, 2019.
  18. S. Akiyama and T. Suzuki, “On learnability via gradient method for two-layer relu neural networks in teacher-student setting,” in International Conference on Machine Learning, pp. 152–162, PMLR, 2021.
  19. Y. Arjevani and M. Field, “Symmetry & critical points for a model shallow neural network,” Physica D: Nonlinear Phenomena, vol. 427, p. 133014, 2021.
  20. Y. Arjevani and M. Field, “Analytic characterization of the hessian in shallow relu models: A tale of symmetry,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  21. Y. Arjevani and M. Field, “Analytic study of families of spurious minima in two-layer relu neural networks: A tale of symmetry ii,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  22. L. Bottou, “Stochastic gradient learning in neural networks,” Proceedings of Neuro-N𝐢𝐢\mathbf{i}bold_imes, vol. 91, no. 8, p. 12, 1991.
  23. Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade, pp. 9–48, Springer, 2012.
  24. L. Sagun, L. Bottou, and Y. LeCun, “Eigenvalues of the hessian in deep learning: Singularity and beyond,” arXiv preprint arXiv:1611.07476, 2016.
  25. L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,” arXiv preprint arXiv:1706.04454, 2017.
  26. S. Hochreiter and J. Schmidhuber, “Flat minima,” Neural Computation, vol. 9, no. 1, pp. 1–42, 1997.
  27. N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On large-batch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836, 2016.
  28. S. Jastrz𝐤𝐤\mathbf{k}bold_kebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey, “Three factors influencing minima in sgd,” arXiv preprint arXiv:1711.04623, 2017.
  29. P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun, C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina, “Entropy-sgd: Biasing gradient descent into wide valleys,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2019, no. 12, p. 124018, 2019.
  30. L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028, JMLR. org, 2017.
  31. Y. Arjevani, “Symmetric critical points,” In preparation., 2024.
  32. Y. Arjevani and M. Field, “Annihilation of spurious minima in two-layer relu networks,” Advances in Neural Information Processing Systems, vol. 35, pp. 37510–37523, 2022.
  33. Y. Arjevani and M. Field, “Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of s k,” Nonlinearity, vol. 35, no. 6, p. 2809, 2022.
  34. Y. Arjevani, J. Bruna, M. Field, J. Kileel, M. Trager, and F. Williams, “Symmetry breaking in symmetric tensor decomposition,” arXiv preprint arXiv:2103.06234, 2021.
  35. Y. Arjevani and G. Vinograd, “Symmetry & critical points for symmetric tensor decompositions problems,” arXiv preprint arXiv:2306.07886, 2023.
  36. A. B. Netto, “Jet-detectable extrema,” Proceedings of the American Mathematical Society, vol. 92, no. 4, pp. 604–608, 1984.
  37. A. Némethi and A. Zaharia, “Milnor fibration at infinity,” Indagationes Mathematicae, vol. 3, no. 3, pp. 323–335, 1992.
  38. T. Lê Loi and A. Zaharia, “Bifurcation sets of functions definable in o𝑜oitalic_o-minimal structures,” Illinois Journal of Mathematics, vol. 42, no. 3, pp. 449–457, 1998.
  39. A. H. Durfee, “The index of gradf (x, y),” Topology, vol. 37, no. 6, pp. 1339–1361, 1998.
  40. World Scientific, 2016.
  41. Cambridge university press, 1998.
  42. L. Van den Dries and C. Miller, “Geometric categories and o-minimal structures,” 1996.
  43. A. Grothendieck, “Esquisse d’un programme,” London Mathematical Society Lecture Note Series, pp. 5–48, 1997.
  44. S. Łojasiewicz, “On semi-analytic and subanalytic geometry,” Banach Center Publications, vol. 34, no. 1, pp. 89–104, 1995.
  45. C. B. Thomas, Representations of finite and Lie groups. World Scientific, 2004.
  46. M. Golubitsky, I. Stewart, and D. á Schaeffer, “Singularities and groups in bifurcation theory á ii,” 1988.
  47. R. Thom, “Ensembles et morphismes stratifiés,” 1969.
  48. J. Mather, “Notes on topological stability,” Harvard university, 1970.
  49. D. Trotman, “Stratification theory,” Handbook of geometry and topology of singularities I, pp. 243–273, 2020.
  50. R. M. Hardt, “Semi-algebraic local-triviality in semi-algebraic mappings,” American Journal of Mathematics, vol. 102, no. 2, pp. 291–302, 1980.
  51. Y. Arjevani, “Deep symmetric breaking,” In preparation., 2024.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com