Deep Learning without Poor Local Minima (1605.07110v3)

Published 23 May 2016 in stat.ML, cs.LG, and math.OC

Abstract: In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

Citations (895)

View on Semantic Scholar

Summary

The paper confirms that every local minimum in deep networks is a global minimum, reducing concerns over suboptimal training outcomes.
It meticulously classifies critical points, showing that non-global minima are saddle points, including cases with zero negative Hessian eigenvalues.
The study extends these insights to nonlinear models via reduction techniques, suggesting pathways for more robust and efficient optimization methods.

Deep Learning without Poor Local Minima: An Examination

Abstract Analysis: This paper, authored by Kenji Kawaguchi from the Massachusetts Institute of Technology, presents substantial theoretical advancements in the field of deep learning optimization. It affirms a conjecture from 1989, refining our understanding of deep neural networks with both linear and nonlinear architectures by revealing properties of their loss surfaces. Kawaguchi's work addresses and extends beyond the conjecture posed in prior literature, establishing comprehensive conditions under which deep models avoid poor local minima.

Core Contributions:

Conjecture Confirmation and Expansion:
- Non-convexity and Non-concavity: The paper rigorously demonstrates that the squared loss function in deep linear networks is neither solely convex nor concave.
- Local and Global Minima Relationship: A crucial insight is that every local minimum in these networks is also a global minimum, mitigating concerns around getting stuck in suboptimal solutions during training.
- Critical Point Classification: Critical points that are not global minima are necessarily saddle points. This conclusion includes a deeper understanding that for models with more than three layers, there exist saddle points where the Hessian matrix lacks negative eigenvalues (referred to as "bad" saddle points).
- Generalization to Nonlinear Networks: Strikingly, these properties extend to deep nonlinear networks using a reduction technique predicated on the independence assumption adapted from recent work.
Mathematical Foundation: The theoretical underpinnings of the paper are firmly rooted in rigorous mathematical analysis. It provides precise statements and proofs, such as the characterization of saddle points via Hessian eigenvalues, and extends these findings to nonlinear models under specific assumptions.
Numerical Stability and Practical Implications: The paper posits that the ability to identify or avoid saddle points through second-order optimization methods or modified gradient descent techniques should, in principle, alleviate some complexities traditionally associated with deep learning model training.

Theoretical and Practical Implications:

Deep Linear Networks:
- Proofs and Properties: The theorems and lemmas presented offer a comprehensive understanding of the loss surface for deep linear networks, establishing that training these models, in theory, does not suffer from poor local minima.
- Saddle Points: A significant takeaway is the behavior of saddle points, particularly highlighting the conditions under which "bad" saddle points appear, delineating a clear distinction between shallow (one-hidden-layer) and deeper networks.
Deep Nonlinear Networks:
- Reduction to Linear Models: By leveraging results from linear models, the paper addresses an open problem from the Conference on Learning Theory (COLT) 2015, substantially reducing the reliance on several unrealistic assumptions prevalent in earlier work.
- Training Efficiency: Showing that the loss function of nonlinear models can be reduced to that of linear models under certain conditions implies potential pathways for efficient training strategies.
Future Directions:
- Algorithm Development: The identification of bad saddle points and the attributes of critical points provide fertile ground for developing advanced optimization algorithms capable of navigating the complex parameter spaces of deep models.
- Gap Bridging: There remains a discrepancy between theoretical advancements and practical application. Bridging this gap involves formulating practical methods that discard the remaining unrealistic assumptions needed for the theoretical guarantees to hold, thereby enhancing the robustness and efficiency of training procedures in real-world scenarios.

Conclusion: The significance of Kenji Kawaguchi's work lies in its profound theoretical contributions to understanding the optimization landscape of deep learning models. By proving that deep networks inherently avoid poor local minima under certain conditions and extending these findings to nonlinear models, the paper lays a robust foundation for future research and methods geared toward more effective and efficient training protocols. This work not only progresses theoretical discourse but also holds the potential to inspire innovative strategies in practical machine learning applications.