Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes

Published 28 Oct 2019 in cs.NE, cond-mat.dis-nn, cs.LG, math-ph, and math.MP | (1910.12478v3)

Abstract: Wide neural networks with random weights and biases are Gaussian processes, as originally observed by Neal (1995) and more recently by Lee et al. (2018) and Matthews et al. (2018) for deep fully-connected networks, as well as by Novak et al. (2019) and Garriga-Alonso et al. (2019) for deep convolutional networks. We show that this Neural Network-Gaussian Process correspondence surprisingly extends to all modern feedforward or recurrent neural networks composed of multilayer perceptron, RNNs (e.g. LSTMs, GRUs), (nD or graph) convolution, pooling, skip connection, attention, batch normalization, and/or layer normalization. More generally, we introduce a language for expressing neural network computations, and our result encompasses all such expressible neural networks. This work serves as a tutorial on the tensor programs technique formulated in Yang (2019) and elucidates the Gaussian Process results obtained there. We provide open-source implementations of the Gaussian Process kernels of simple RNN, GRU, transformer, and batchnorm+ReLU network at github.com/thegregyang/GP4A.

Abstract PDF Upgrade to Chat

Citations (183)

View on Semantic Scholar

Summary

The paper demonstrates that wide neural networks of any architecture, expressed through Tensor Programs, converge to Gaussian processes.
It rigorously develops analytical proofs and kernel formulations for models ranging from feedforward networks to recurrent architectures like LSTMs and GRUs.
Empirical tests confirm that neural networks with random parameters yield Gaussian output distributions, reinforcing the theoretical GP correspondence.

An Analysis of Tensor Programs and Their Gaussian Process Correspondences

The paper in question explores a comprehensive examination of the intersection between neural network architectures and Gaussian processes (GPs). It builds on the foundational observation that wide neural networks with random weights tend to behave like GPs—a notion first observed by Neal in 1995. This document extends the current understanding by demonstrating that this correspondence holds across a broad variety of modern neural architectures, including those with multilayer perceptrons, recurrent neural networks (RNNs) like LSTMs and GRUs, attention mechanisms, and normalization techniques, among others.

This work introduces Tensor Programs, a generalized language for expressing neural network computations. Tensor Programs provide a unified framework that encompasses all expressible neural networks to demonstrate their convergence behavior in the context of network widths reaching infinity. Specifically, such networks form GPs with variable-dimensional outputs, which are essential for rigorously characterizing the nature of these neural networks under random parameters. The paper provides rigorous mathematical proofs and establishes that the NN-GP correspondence holds universally for standard architectures.

Key Contributions and Results

Wide-ranging Applicability: The Tensor Programs language is a noteworthy development, representing a collection of rules and operations that handle computations across various neural network architectures. This universality implies that any conceivable standard neural network architecture can be represented within this framework and analyzed for GP convergence.
Gaussian Process Interpretation: By focusing on the characterization of determinant covariance kernels, the work demystifies how extensive, wide neural networks behave akin to Gaussian processes. The paper gives robust expressions for these kernel calculations, which apply even in scenarios with recurrent updates or normalization adjustments.
Empirical Verification: The paper substantiates its theoretical claims with empirical evidence by implementing and testing the GP convergence behavior in simple RNNs, GRUs, and Transformers. For instance, the study computed expected output distributions, showing the outputs are Gaussian distributed, even for networks with configurations as complex as transformers.
Technical Rigor: Through carefully structured mathematical frameworks and rigorous derivations, the work provides comprehensive proofs for the wide applicability and GP convergence results. Particularly interesting is the Tensor Programs’ ability to transform neural network computations into mathematical formulations that facilitate these proofs.
Code Availability: The open-source implementations available corroborate the paper's analytical results, offering a valuable resource for peer verification and further experimentation by the research community.

Implications and Future Directions

The implications of this research are manifold, not least being its potential to advance a better understanding of neural network behavior in the initialization phase and during Bayesian updates. The analytical tools developed can potentially lead to more refined initialization strategies and training regimes that leverage these GP insights.

Future work is expected to explore deeper the application of Tensor Programs in formulating neural tangent kernels, which are crucial for understanding the dynamics of neural networks during training. Additionally, the extension of this framework to accommodate matrix transpositions will likely provide further insights into model behaviors and transformations.

Overall, the paper provides a solid theoretical and practical foundation for understanding how intricate and large-scale neural systems inherently channel Gaussian processes. This work bridges a critical gap in the analytical understanding of modern AI systems, serving as a stepping stone for advancing the theoretical depths of machine learning.

Markdown Report Issue