Emergent Mind

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks

(2310.02244)
Published Oct 3, 2023 in cs.NE , cond-mat.dis-nn , and math.PR

Abstract

By classifying infinite-width neural networks and identifying the optimal limit, Tensor Programs IV and V demonstrated a universal way, called $\mu$P, for widthwise hyperparameter transfer, i.e., predicting optimal hyperparameters of wide neural networks from narrow ones. Here we investigate the analogous classification for depthwise parametrizations of deep residual networks (resnets). We classify depthwise parametrizations of block multiplier and learning rate by their infinite-width-then-depth limits. In resnets where each block has only one layer, we identify a unique optimal parametrization, called Depth-$\mu$P that extends $\mu$P and show empirically it admits depthwise hyperparameter transfer. We identify feature diversity as a crucial factor in deep networks, and Depth-$\mu$P can be characterized as maximizing both feature learning and feature diversity. Exploiting this, we find that absolute value, among all homogeneous nonlinearities, maximizes feature diversity and indeed empirically leads to significantly better performance. However, if each block is deeper (such as modern transformers), then we find fundamental limitations in all possible infinite-depth limits of such parametrizations, which we illustrate both theoretically and empirically on simple networks as well as Megatron transformer trained on Common Crawl.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.