Convergence Analysis for Learning Orthonormal Deep Linear Neural Networks (2311.14658v2)
Abstract: Enforcing orthonormal or isometric property for the weight matrices has been shown to enhance the training of deep neural networks by mitigating gradient exploding/vanishing and increasing the robustness of the learned networks. However, despite its practical performance, the theoretical analysis of orthonormality in neural networks is still lacking; for example, how orthonormality affects the convergence of the training process. In this letter, we aim to bridge this gap by providing convergence analysis for training orthonormal deep linear neural networks. Specifically, we show that Riemannian gradient descent with an appropriate initialization converges at a linear rate for training orthonormal deep linear neural networks with a class of loss functions. Unlike existing works that enforce orthonormal weight matrices for all the layers, our approach excludes this requirement for one layer, which is crucial to establish the convergence guarantee. Our results shed light on how increasing the number of hidden layers can impact the convergence speed. Experimental results validate our theoretical analysis.
- D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
- A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013.
- Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
- R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in International conference on machine learning, pp. 1310–1318, Pmlr, 2013.
- Q. V. Le, N. Jaitly, and G. E. Hinton, “A simple way to initialize recurrent networks of rectified linear units,” arXiv preprint arXiv:1504.00941, 2015.
- M. Arjovsky, A. Shah, and Y. Bengio, “Unitary evolution recurrent neural networks,” in International conference on machine learning, pp. 1120–1128, PMLR, 2016.
- B. Hanin, “Which neural net architectures give rise to exploding and vanishing gradients?,” Advances in neural information processing systems, vol. 31, 2018.
- M. Harandi and B. Fernando, “Generalized backpropagation etude de cas: Orthogonality,” arXiv preprint arXiv:1611.05927, 2016.
- S. Li, K. Jia, Y. Wen, T. Liu, and D. Tao, “Orthogonal deep neural networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 4, pp. 1352–1368, 2019.
- L. Huang, L. Liu, F. Zhu, D. Wan, Z. Yuan, B. Li, and L. Shao, “Controllable orthogonalization in training dnns,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6429–6438, 2020.
- J. Li, F. Li, and S. Todorovic, “Efficient riemannian optimization on the stiefel manifold via the cayley transform,” in International Conference on Learning Representations, 2020.
- J. Wang, Y. Chen, R. Chakraborty, and S. X. Yu, “Orthogonal convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11505–11515, 2020.
- F. Malgouyres and F. Mamalet, “Existence, stability and scalability of orthogonal convolutional neural networks,” Journal of Machine Learning Research, vol. 23, pp. 1–56, 2022.
- V. Dorobantu, P. A. Stromhaug, and J. Renteria, “Dizzyrnn: Reparameterizing recurrent neural networks for norm-preserving backpropagation,” arXiv preprint arXiv:1612.04035, 2016.
- Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey, “Efficient orthogonal parametrisation of recurrent neural networks using householder reflections,” in International Conference on Machine Learning, pp. 2401–2409, PMLR, 2017.
- E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal, “On orthogonality and learning recurrent networks with long term dependencies,” in International Conference on Machine Learning, pp. 3570–3578, PMLR, 2017.
- M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier, “Parseval networks: Improving robustness to adversarial examples,” in International Conference on Machine Learning, pp. 854–863, PMLR, 2017.
- M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra, “Reducing overfitting in deep networks by decorrelating representations,” arXiv preprint arXiv:1511.06068, 2015.
- N. Bansal, X. Chen, and Z. Wang, “Can we gain more from orthogonality regularizations in training deep networks?,” Advances in Neural Information Processing Systems, vol. 31, 2018.
- L. Huang, X. Liu, B. Lang, A. Yu, Y. Wang, and B. Li, “Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
- P. Bartlett, D. Helmbold, and P. Long, “Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks,” in International conference on machine learning, pp. 521–530, PMLR, 2018.
- S. Arora, N. Cohen, N. Golowich, and W. Hu, “A convergence analysis of gradient descent for deep linear neural networks,” arXiv preprint arXiv:1810.02281, 2018.
- D. Zou, P. M. Long, and Q. Gu, “On the global convergence of training deep linear resnets,” arXiv preprint arXiv:2003.01094, 2020.
- O. Shamir, “Exponential convergence time of gradient descent for one-dimensional deep linear neural networks,” in Conference on Learning Theory, pp. 2691–2713, PMLR, 2019.
- Z. Allen-Zhu, Y. Li, and Z. Song, “On the convergence rate of training recurrent neural networks,” Advances in neural information processing systems, vol. 32, 2019.
- S. Chatterjee, “Convergence of gradient descent for deep neural networks,” arXiv preprint arXiv:2203.16462, 2022.
- M. Zhou, R. Ge, and C. Jin, “A local convergence theory for mildly over-parameterized two-layer neural network,” in Conference on Learning Theory, pp. 4577–4632, PMLR, 2021.
- X. Zhang, Y. Yu, L. Wang, and Q. Gu, “Learning one-hidden-layer relu networks via gradient descent,” in The 22nd international conference on artificial intelligence and statistics, pp. 1524–1534, PMLR, 2019.
- R. Han, R. Willett, and A. R. Zhang, “An optimal statistical and computational framework for generalized tensor estimation,” arXiv preprint arXiv:2002.11255, 2020.
- X. Li, S. Chen, Z. Deng, Q. Qu, Z. Zhu, and A. Man-Cho So, “Weakly convex optimization over stiefel manifold using riemannian subgradient-type methods,” SIAM Journal on Optimization, vol. 31, no. 3, pp. 1605–1634, 2021.
- Z. Zhu, D. Soudry, Y. C. Eldar, and M. B. Wakin, “The global optimization geometry of shallow linear neural networks,” Journal of Mathematical Imaging and Vision, vol. 62, pp. 279–292, 2020.
- S. Tu, R. Boczar, M. Simchowitz, M. Soltanolkotabi, and B. Recht, “Low-rank solutions of linear matrix equations via procrustes flow,” in International Conference on Machine Learning, pp. 964–973, PMLR, 2016.
- Z. Zhu, Q. Li, G. Tang, and M. B. Wakin, “The global optimization geometry of low-rank matrix optimization,” IEEE Transactions on Information Theory, vol. 67, no. 2, pp. 1308–1331, 2021.
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. Schoenholz, and J. Pennington, “Dynamical isometry and a mean field theory of cnns: How to train 10,000-layer vanilla convolutional neural networks,” in International Conference on Machine Learning, pp. 5393–5402, PMLR, 2018.
- J. Zhou, C. You, X. Li, K. Liu, S. Liu, Q. Qu, and Z. Zhu, “Are all losses created equal: A neural collapse perspective,” in Advances in Neural Information Processing Systems, 2022.