The Principles of Deep Learning Theory (2106.10165v2)

Published 18 Jun 2021 in cs.LG, cs.AI, hep-th, and stat.ML

Abstract: This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.

Citations (218)

View on Semantic Scholar

Summary

The paper introduces an effective theory approach using a 1/n expansion to unravel deep neural networks’ initialization and training dynamics.
It employs statistical mechanics and Gaussian perturbation methods to derive analytic expressions for network correlators and assess criticality via the depth-to-width ratio.
The framework bridges theory and practice by applying RG flow analogies to elucidate how non-Gaussian features emerge in realistic MLP architectures.

This document outlines the initial chapters of "The Principles of Deep Learning Theory", a research monograph aiming to provide a first-principles theoretical understanding of deep neural networks, particularly those used in practice. It employs concepts from theoretical physics, such as effective field theory and statistical mechanics, to analyze realistic network architectures like Multilayer Perceptrons (MLPs).

Preface

The authors state their pedagogical approach, prioritizing intuition while maintaining calculational rigor. They emphasize the analysis of realistic deep networks rather than idealized limits (like single-hidden-layer or strict infinite-width networks, which are only used as starting points). The focus is primarily on MLPs to illustrate the effective theory framework, but the methods are expected to be extendable. The work builds upon collaborations, notably with Boris Hanin, and aims to bridge the gap between deep learning theory and practice, hoping to guide future AI model development.

Chapter 0: Initialization

This chapter introduces the core challenge: understanding how deep neural networks, composed of numerous simple components (neurons), compute complex functions after training. It contrasts the easily understood microscopic description (parameter settings and forward pass) with the opaque macroscopic behavior (why a specific function is learned).

The authors propose an effective theory approach, drawing parallels with thermodynamics and statistical mechanics where macroscopic laws emerge from microscopic statistics. Deep learning theory simplifies when networks have many components (neurons). The goal is to understand the emergent macroscopic regularities (learned function) from the known microscopic rules (network architecture and parameter initialization/training).

The chapter then formalizes the problem by considering a network $f(x; \theta)$ parameterized by $\theta$ . Training involves initializing $\theta$ randomly from $p(\theta)$ and then adjusting it to $\theta^\star$ to approximate a target function $f(x)$ . Understanding $f(x; \theta^\star)$ via its Taylor expansion around the initial $\theta$ presents three main problems:

Infinite Terms: The expansion generally involves infinitely many derivatives ( $d^k f / d\theta^k$ ).
Randomness: Initialization induces a complex joint probability distribution over $f$ and all its derivatives.
Complex Dynamics: The trained parameters $\theta^\star$ depend intricately on initialization, the learning algorithm, and training data.

The key to solving these problems lies in the large-width limit ( $n \to \infty$ ) and subsequent $1/n$ expansion.

Infinite-width limit ( $n \to \infty$ , or $L/n \to 0$ ): This limit drastically simplifies the problem. Only $f$ and $df/d\theta$ are relevant (Problem 1 solved by sparsity); their distributions become simple and independent (Problem 2 solved); training dynamics become linear and analytically solvable (Problem 3 solved). The final distribution $p(f^\star)$ is Gaussian. However, this limit fails to capture representation learning and doesn't accurately model deep networks used in practice ("not really deep").
Finite-width ($1/n$ expansion, $0 < L/n \ll 1$ ): By treating $1/n$ as a small parameter, perturbation theory can be applied. To leading order ( $O(1/n)$ ), only $f, df/d\theta, d^2f/d\theta^2, d^3f/d\theta^3$ are needed (Problem 1 nearly solved by sparsity); their joint distribution is nearly-Gaussian and tractable (Problem 2 nearly solved); training dynamics become analytically tractable via perturbation theory, capturing dependence on the learning algorithm (Problem 3 nearly solved). This regime describes "effectively deep" networks, captures representation learning, and corresponds better to practical scenarios. The depth-to-width aspect ratio $r = L/n$ emerges as the crucial parameter governing the validity of this expansion.
Overly deep ( $L/n \gg 1$ ): Perturbation theory breaks down, networks behave chaotically, and there's no simple effective description.

Chapter 1: Pretraining

This chapter provides the mathematical toolkit needed, focusing on Gaussian integrals and statistics, as the theory relies heavily on nearly-Gaussian distributions.

Gaussian Integrals: Covers single and multi-variable Gaussian integrals, normalization factors (partition functions), variance/covariance matrices, deriving Wick's theorem for computing moments ( $E{z_{\mu_1}\dots z_{\mu_{2m}}}$ ) using generating functions (partition function with sources). Introduces index notation ( $K_{\mu\nu}$ vs $K^{\mu\nu}$ ).
Probability and Statistics: Defines expectation values, observables, moments (M-point correlators), and cumulants (connected M-point correlators). Explains how connected correlators measure deviations from Gaussianity (vanishing for Gaussians with $M>2$ ). Defines nearly-Gaussian distributions as those with small connected correlators for $M>2$ .
Nearly-Gaussian Distributions: Introduces the action representation $p(z) \propto e^{-S(z)}$ . Shows the quadratic action $S(z) \propto z K^{-1} z$ yields a Gaussian distribution. Demonstrates how adding higher-order terms (e.g., a small quartic coupling $\epsilon V z^4$ ) creates nearly-Gaussian distributions. Explains how to use perturbation theory in the small coupling $\epsilon$ to compute correlators, relating couplings in the action to connected correlators (e.g., quartic coupling $\leftrightarrow$ connected 4-point correlator). Discusses interactions as the breakdown of statistical independence. Introduces the concept of hierarchical scaling of connected correlators ( $\propto \epsilon^{m-1}$ ) for wide networks and the validity of truncating the action.

Chapter 2: Neural Networks

This chapter introduces the basic components and concepts of neural networks.

Function Approximation: Frames neural networks as flexible parameterized functions $f(x; \theta)$ used for function approximation, trained on data rather than explicitly programmed.
MLP Architecture: Defines the basic neuron (preactivation $z$ , activation $\sigma(z)$ ), layer, weights $W$ , and biases $b$ . Introduces the Multilayer Perceptron (MLP) architecture via its recursive definition, defining network depth $L$ and layer widths $n_\ell$ . Explains hidden layers and mentions other architectures (CNNs, Transformers) and inductive bias. MLPs serve as the archetype for illustrating the theory.
Activation Functions: Lists and plots common activation functions ( $\sigma(z)$ ): perceptron, sigmoid, tanh, sin, scale-invariant (linear, ReLU, leaky ReLU), and ReLU-like (softplus, SWISH, GELU). Discusses properties like nonlinearity, smoothness, saturation, and scale invariance.
Ensembles & Initialization: Explains the need for random initialization (vs. zero initialization) to break symmetry. Details the standard Gaussian initialization scheme for biases and weights, defining initialization hyperparameters ( $C_b^{(\ell)}$ , $C_W^{(\ell)}$ ) and the $1/n_{\ell-1}$ scaling for weights. Introduces the concept of the induced distribution over network preactivations/outputs $p(z^{(\ell)}|D)$ that results from the parameter initialization. Introduces the Dirac delta function $\delta(z-s)$ to handle the deterministic relationship between parameters and outputs within the integral defining the induced distribution.

Chapter 3: Effective Theory of Deep Linear Networks at Initialization

This chapter applies the developed tools to a simplified toy model: the deep linear network (MLP with $\sigma(z)=z$ ).

Model Definition: Defines the deep linear network, setting biases to zero for simplicity. Notes that the output $z^{(\ell)}$ is a linear transformation $W^{(\ell)} x$ of the input, where $W^{(\ell)}$ is a product of weight matrices.
Criticality: Derives and solves the exact layer-to-layer recursion for the two-point correlator (covariance kernel) $G^{(\ell)}_{\alpha_1\alpha_2}$ . Shows the solution is $G^{(\ell)}_{\alpha_1\alpha_2} = (C_W)^\ell G^{(0)}_{\alpha_1\alpha_2}$ . This highlights the exploding/vanishing kernel problem and the need for criticality ( $C_W=1$ ) to maintain signal propagation. Discusses trivial vs. nontrivial fixed points.
Fluctuations: Derives and solves the recursion for the four-point correlator (for a single input). Shows that finite width ( $n<\infty$ ) introduces non-Gaussianity (non-zero connected 4-point correlator), scaling with the depth-to-width ratio $L/n$ . Interprets this non-Gaussianity as measuring intralayer interactions and instantiation-to-instantiation fluctuations of observables (e.g., average activation magnitude).
Chaos: Derives and solves recursions for arbitrary $M$ -point correlators (for a single input). Analyzes the large-width ( $n\to\infty$ , fixed $L$ ) and large-depth ( $L\to\infty$ , fixed $n$ ) limits, showing they don't commute. Introduces the interpolating limit ( $n, L \to \infty$ , $L/n$ fixed), confirming $L/n$ as the perturbative parameter. Shows that for large $L/n$ , perturbation theory breaks down, leading to chaotic behavior even at criticality. Highlights the hierarchical scaling of connected correlators ( $\propto (L/n)^{m-1}$ ).

Chapter 4: RG Flow of Preactivations

This chapter develops the core effective theory for general MLPs at initialization using the $1/n$ expansion. It introduces the Representation Group (RG) flow analogy.

First Layer: Re-derives (using Wick contractions and Hubbard-Stratonovich) that the first-layer preactivation distribution $p(z^{(1)}|D)$ is exactly Gaussian, defined by the first-layer metric $G^{(1)}_{\alpha\beta}$ . Computes Gaussian expectations of activation functions $\sigma(z_\alpha)\sigma(z_\beta) _{G^{(1)}}$ .
Second Layer: Shows non-Gaussianity emerges. The conditional distribution $p(z^{(2)}|z^{(1)})$ is Gaussian but with a stochastic metric $\widehat{G}^{(2)}_{\alpha\beta}$ depending on $z^{(1)}$ . Marginalizing over $z^{(1)}$ yields a nearly-Gaussian distribution for $p(z^{(2)}|D)$ . Introduces the mean metric $G^{(2)}_{\alpha\beta}$ and the four-point vertex $V^{(2)}_{(\alpha\beta)(\gamma\delta)} = n_1 E{\Delta \widehat{G}^{(2)}_{\alpha\beta} \Delta \widehat{G}^{(2)}_{\gamma\delta} }$ . Derives the leading $O(1/n)$ quartic action, relating its couplings to $G^{(2)}$ and $V^{(2)}$ .
Deeper Layers: Generalizes the recursive approach. Derives the layer-to-layer recursions for the mean metric $G^{(\ell+1)}$ and the four-point vertex $V^{(\ell+1)}$ , showing how they depend on expectations evaluated with the $\ell$ -th layer distribution $p(z^{(\ell)}|D)$ . Explains how non-Gaussianity accumulates with depth.
Marginalization Rules: Discusses the consistency of probability distributions under marginalization over samples or neurons, justifying the perturbative treatment despite naive scaling arguments. Explains the running of couplings (e.g., quadratic coupling $g^{(\ell)}_m$ ) with the number of neurons $m$ considered.
Subleading Corrections: Briefly discusses the structure of higher-order corrections in the $1/n$ expansion, deriving the $O(1/n)$ correction to the metric ($#1{\ell}$).
RG Flow Analogy: Draws a strong parallel between the layer-to-layer marginalization process in deep networks and Renormalization Group (RG) flow in physics. Proposes the term Representation Group (RG) flow to describe how the effective distribution (and its couplings/correlators) evolve with depth, capturing coarse-graining of representations. Discusses relevant vs. irrelevant couplings under RG flow.

In essence, the book lays out a systematic theoretical framework, rooted in statistical physics and perturbative methods ($1/n$ expansion), to understand how the statistical properties of neural network activations evolve with depth, data, initialization, and architecture, paving the way for a principled understanding of deep learning phenomena like representation learning and generalization. The concept of criticality and the importance of the depth-to-width ratio are central recurring themes.This document provides a detailed summary of the initial chapters of the book "The Principles of Deep Learning Theory" by Roberts and Yaida (2106.10165). The book aims to develop a theoretical framework for deep learning based on first principles, drawing heavily from concepts in theoretical physics like effective field theory and statistical mechanics. It focuses on understanding realistic deep neural networks, particularly Multilayer Perceptrons (MLPs), as used in practice.

Preface

The authors outline their pedagogical approach, prioritizing intuition and detailed calculations. They focus on realistic deep networks, using MLPs as a primary example to illustrate their effective theory framework. This approach deviates from some traditional theoretical analyses that rely heavily on idealized limits like single-hidden-layer networks or the strict infinite-width limit, which are used here only as starting points. The book acknowledges the contributions of the deep learning community and aims to build a theory grounded in first principles that explains practical observations. It grew out of research collaboration with Boris Hanin.

Chapter 0: Initialization

This chapter sets the stage by contrasting the microscopic view (network parameters) and the macroscopic view (learned function) of deep networks. It introduces the effective theory approach, inspired by the success of statistical physics in explaining macroscopic phenomena (like thermodynamics) from microscopic statistics. The core challenge is to understand the trained network function $f(x; \theta^\star)$ . Analyzing its Taylor expansion around the initial parameters $\theta$ reveals three major difficulties: the potential need for infinite terms, the complex probability distribution induced by random initialization, and the intricate dependence of the final parameters $\theta^\star$ on the initialization, learning algorithm, and data.

The solution proposed leverages the large-width limit ( $n \to \infty$ ) combined with a $1/n$ expansion.

The strict infinite-width limit simplifies the problem significantly (making the distribution Gaussian and dynamics linear) but fails to capture essential aspects like representation learning and doesn't accurately model practical deep networks.
The $1/n$ expansion treats $1/n$ as a small parameter, allowing the use of perturbation theory for large but finite $n$ . This approach addresses the three core problems: a sparsity principle limits the number of terms needed in the Taylor expansion, the induced distributions become tractable (nearly-Gaussian), and the training dynamics can be analyzed perturbatively, revealing dependence on the learning algorithm.
The depth-to-width aspect ratio $r = L/n$ emerges as the crucial parameter controlling the expansion's validity. Small $r$ corresponds to "effectively deep" networks well-described by the leading $1/n$ theory, while large $r$ leads to "overly deep" networks where the theory breaks down due to strong interactions and chaos.

Chapter 1: Pretraining

This chapter equips the reader with the necessary mathematical tools, emphasizing Gaussian integrals and statistics, as the theory heavily relies on nearly-Gaussian distributions.

It covers Gaussian integrals (single and multi-variable), normalization (partition functions), and Wick's theorem for calculating moments of Gaussian distributions using generating functions.
It introduces fundamental probability concepts like expectation values, observables, moments (M-point correlators), and cumulants (connected M-point correlators). Connected correlators are highlighted as key measures of deviation from Gaussianity.
Nearly-Gaussian distributions are defined as those with small connected correlators ( $M>2$ ). The action representation ( $p(z) \propto e^{-S(z)}$ ) is introduced.
It demonstrates how perturbing the quadratic action (corresponding to a Gaussian distribution) with higher-order terms (like a quartic coupling $\epsilon V z^4$ ) generates nearly-Gaussian distributions. Perturbation theory connects the action's couplings to the distribution's connected correlators.
The concepts of statistical independence and interactions are discussed in the context of the action. The hierarchical scaling of connected correlators in wide networks is mentioned.

Chapter 2: Neural Networks

This chapter introduces the building blocks of neural networks and the specific architecture focused on in the book.

Neural networks are presented as function approximators. The basic components – neurons (preactivation, activation function), layers, weights, biases – are defined.
The Multilayer Perceptron (MLP) architecture is defined by its iterative forward pass equations. Key hyperparameters like depth ( $L$ ) and width ( $n_\ell$ ) are introduced. The MLP serves as the archetype for the theoretical development due to its simplicity while capturing essential deep learning aspects. Other architectures (CNNs, Transformers) and the concept of inductive bias are briefly discussed.
Common activation functions ( $\sigma(z)$ ) are listed and described, including perceptron, sigmoid, tanh, ReLU, leaky ReLU, SWISH, GELU, etc., discussing properties like scale invariance and saturation.
Network initialization is detailed, explaining the necessity of random initialization (typically Gaussian) over zero initialization. Initialization hyperparameters ( $C_b^{(\ell)}, C_W^{(\ell)}$ ) and the standard $1/n_{\ell-1}$ scaling for weight variance are defined.
The crucial concept of the induced distribution over network preactivations $p(z^{(\ell)}|D)$ arising from random parameter initialization is introduced. The Dirac delta function is presented as a tool to handle the deterministic relationship between parameters and outputs within the integral defining this distribution.

Chapter 3: Effective Theory of Deep Linear Networks at Initialization

This chapter analyzes a simplified deep linear network (MLP with $\sigma(z)=z$ ) as a solvable toy model to illustrate key concepts.

Despite computing only linear transformations, the statistics of deep linear networks depend non-trivially on depth and width.
The layer-to-layer recursion for the two-point correlator (covariance kernel $G^{(\ell)}$ ) is derived and solved exactly: $G^{(\ell)} = (C_W)^\ell G^{(0)}$ . This highlights the exploding/vanishing kernel problem and motivates criticality – tuning $C_W=1$ to ensure stable signal propagation and reach a non-trivial fixed point.
The four-point correlator recursion is derived and solved (for a single input). It demonstrates that finite width ($1/n$) induces non-Gaussianity (non-zero connected 4-point correlator) and fluctuations, controlled by the depth-to-width ratio $L/n$ .
The analysis reveals the non-commutative nature of the large-width and large-depth limits and shows how the $L/n$ ratio acts as a perturbative parameter. For large $L/n$ , chaos emerges.

Chapter 4: RG Flow of Preactivations

This chapter develops the effective theory for general MLPs at initialization using the $1/n$ expansion and introduces the Representation Group (RG) flow analogy.

The recursive approach analyzes the evolution of the preactivation distribution $p(z^{(\ell)}|D)$ layer-by-layer.
The first-layer distribution $p(z^{(1)}|D)$ is shown to be exactly Gaussian, defined by the first-layer metric $G^{(1)}_{\alpha\beta}$ .
The second-layer distribution $p(z^{(2)}|D)$ becomes nearly-Gaussian. This arises because the conditional distribution $p(z^{(2)}|z^{(1)})$ is Gaussian with a stochastic metric $\widehat{G}^{(2)}_{\alpha\beta}$ that depends on the random first-layer preactivations $z^{(1)}$ . Marginalizing $z^{(1)}$ introduces non-Gaussianity, characterized at leading order by the mean metric $G^{(2)}_{\alpha\beta}$ and the four-point vertex $V^{(2)}_{(\alpha\beta)(\gamma\delta)}$ , which measures the metric's fluctuations. A leading $O(1/n)$ quartic action is derived.
General layer-to-layer recursions for the mean metric $G^{(\ell+1)}$ and four-point vertex $V^{(\ell+1)}$ are derived, showing how non-Gaussianity accumulates with depth.
The Representation Group (RG) flow analogy is established: the layer-by-layer marginalization process is analogous to RG flow in physics, describing how the effective distribution of representations evolves and coarse-grains with network depth. Relevant vs. irrelevant couplings under this flow are discussed.

In summary, these initial chapters establish a theoretical framework using tools from physics to analyze deep neural networks. They introduce the $1/n$ expansion as a key technique to paper realistic finite-width networks, highlighting the importance of the depth-to-width ratio. The concept of criticality for stable signal propagation is introduced using linear networks and then generalized. The RG flow perspective provides a way to understand how network statistics and interactions evolve with depth, setting the stage for analyzing learning dynamics in later chapters.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KirkDBorne/status/1777030504811770115

https://twitter.com/KirkDBorne/status/1850399151932031297

https://twitter.com/KirkDBorne/status/1839010946083110981

https://twitter.com/KirkDBorne/status/1753548572316836075

https://twitter.com/KirkDBorne/status/1871773111131193781

https://twitter.com/KirkDBorne/status/1890790155306320138

YouTube

Show All Videos