An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis (1703.00560v2)

Published 2 Mar 2017 in cs.LG

Abstract: In this paper, we explore theoretical properties of training a two-layered ReLU network $g(\mathbf{x}; \mathbf{w}) = \sum_{j=1}^K \sigma(\mathbf{w}_j^{T\mathbf{x})$} with centered $d$-dimensional spherical Gaussian input $\mathbf{x}$ ($\sigma$=ReLU). We train our network with gradient descent on $\mathbf{w}$ to mimic the output of a teacher network with the same architecture and fixed parameters $\mathbf{w}^*$. We show that its population gradient has an analytical formula, leading to interesting theoretical analysis of critical points and convergence behaviors. First, we prove that critical points outside the hyperplane spanned by the teacher parameters ("out-of-plane") are not isolated and form manifolds, and characterize in-plane critical-point-free regions for two ReLU case. On the other hand, convergence to $\mathbf{w}^*$ for one ReLU node is guaranteed with at least $(1-\epsilon)/2$ probability, if weights are initialized randomly with standard deviation upper-bounded by $O(\epsilon/\sqrt{d})$, consistent with empirical practice. For network with many ReLU nodes, we prove that an infinitesimal perturbation of weight initialization results in convergence towards $\mathbf{w}^*$ (or its permutation), a phenomenon known as spontaneous symmetric-breaking (SSB) in physics. We assume no independence of ReLU activations. Simulation verifies our findings.

Citations (214)

View on Semantic Scholar

Summary

The paper introduces an analytical formula for the population gradient that clarifies convergence conditions and aids in critical point analysis.
It demonstrates that critical points outside the principal hyperplane form manifolds, while identifying critical-point-free regions in two-node networks.
The research provides rigorous convergence proofs, showing that proper weight initialization leads to recovery of teacher weights through spontaneous symmetry breaking.

Analytical Formula of Population Gradient for Two-Layer ReLU Networks: A Rigorous Approach to Convergence and Critical Point Analysis

The landscape of neural network optimization, particularly in deep learning, often poses a significant challenge due to its non-convex nature. The paper "An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis" explores the underpinnings of this problem by presenting a detailed theoretical exploration of a two-layered ReLU network. This work primarily focuses on understanding the critical points and convergence behaviors associated with such networks, using an analytical formula for the population gradient.

Key Contributions

The paper introduces an innovative analytical formula for the population gradient of a two-layered ReLU network trained with gradient descent. This network's architecture comprises a set of ReLU nodes, where the network is optimized to match the output of a "teacher" network with fixed parameters. The research outlines that the population gradient has a tractable analytical representation when the input distribution follows a zero-mean spherical Gaussian.

Significantly, the authors provide rigorous proofs of several properties and behaviors:

Critical Points Formation: Critical points outside the principal hyperplane spanned by the teacher's parameters form manifolds rather than being isolated. This insight lays a foundation for understanding the non-isolated nature of critical points, a phenomenon that might explain the flat minima often encountered during neural network training.
Critical-Point-Free Regions: For a case involving two ReLU nodes, the paper characterizes regions within the principal hyperplane that are devoid of critical points, which serves as a key insight into the dynamics of these systems.
Convergence Analysis: The work presents rigorous convergence analysis for networks with single and multiple ReLU nodes. For one ReLU node, it is demonstrated that convergence to the teacher's weights occurs with at least $(1-\epsilon)/2$ probability, assuming that weights are initialized randomly with a bound on their standard deviation. For networks with multiple ReLU nodes, the study introduces the concept of spontaneous symmetric-breaking (SSB) — wherein a slight perturbation in weight initialization leads the network to converge towards a permutation of the teacher's weights.

Implications and Future Research

The implications of this research are multi-faceted.

In Practice: By providing an analytical formula for the population gradient, this study aids in demystifying the conditions under which neural networks converge effectively. This understanding is instrumental for developing better initialization strategies and may inform the design of new optimization algorithms.
Theoretical Insight: From a theoretical standpoint, the paper challenges prevailing assumptions about the isolated nature of critical points, offering a nuanced perspective that these may exist on manifolds. This has ramifications for how researchers understand and model the energy landscape of neural networks.
Future Work: The research presents several avenues for further exploration. An immediate follow-up could involve extending the analytical framework to accommodate multi-layered ReLU networks. Additionally, exploring how various input distributions influence the geometry of critical points stands as a promising line of inquiry. Moreover, the generalization to scenarios where activation functions beyond ReLU are employed could yield broader insights applicable across different architectures and learning paradigms.

In conclusion, the analytical treatment of the population gradient provided in this paper opens a pathway to deeper understanding of neural network optimization. By focusing on convergence and critical points, the study lays crucial groundwork that could facilitate the development of more robust and efficient training methodologies in the domain of deep learning.