Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (2203.03466v2)

Published 7 Mar 2022 in cs.LG, cond-mat.dis-nn, and cs.NE

Abstract: Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via pip install mup.

Citations (127)

View on Semantic Scholar

Summary

The paper introduces μTransfer, a method that maintains stable optimal hyperparameters across model sizes using μP.
It demonstrates that transferring tuned hyperparameters from a small proxy model to larger ones yields superior performance and drastically reduces tuning costs.
Empirical results illustrate a 40M-to-6.7B parameter transfer achieving up to 93% cost reduction while enhancing model performance.

Maximal Update Parametrization: A New Approach to Efficient Hyperparameter Tuning

Introduction

Hyperparameter (HP) tuning in deep learning can be incredibly expensive, especially when we're dealing with neural networks that have billions of parameters. Traditional HP tuning methods, like grid or random search, can be cost-prohibitive for these large models. This paper introduces a new trick called Maximal Update Parametrization ( $\mu$ P), which aims to make HP tuning more manageable.

The Concept of $\mu$ Transfer

The core idea here is called $\mu$ Transfer. It works by tuning HPs on a smaller model that uses the $\mu$ P parametrization and then transferring those HPs to a much larger model without any need for additional tuning. So, what's special about $\mu$ P?

Why Standard Parametrization (SP) Fails

In standard parametrizations (SP), as the model size increases, the optimal HPs tend to change a lot, making it hard to predict the right HPs for a large model based on a smaller model. For instance, an optimal learning rate for a small model might cause a larger model to diverge during training. This inconsistency forces researchers to tune large models directly, which is very costly.

The $\mu$ P Advantage

When using $\mu$ P, the optimal HPs remain stable as the model size increases. This means:

Better Performance: Wide models using $\mu$ P tend to outperform those using SP with the same HPs.
Massive Speedups: It reduces the tuning cost significantly. Imagine tuning a 40M parameter model and having those HPs work well for a 6.7B parameter GPT-3 model. The paper provides strong empirical results to back up these claims.
Consistency Across Model Families: You only need to tune HPs once on a small model to apply them across a whole family of models, like different versions of BERT or GPT-3.
Better Compute Utilization: You can perform the heavy-lifting of HP tuning on smaller models that don't need distributed training across many GPUs.

Strong Experimental Results

The paper discusses experiments involving Transformers and ResNet models, and here's the gist of what they found:

For the 6.7B parameter GPT-3 model, transferring HPs from a 40M parameter model achieved results that even outperformed the published numbers, and at only 7% of the tuning cost.
For BERT-large, transferring HPs from a 13M proxy model also outperformed the baseline.

Key Takeaways

Stable HPs Across Model Sizes: The concept of using $\mu$ P makes the HPs stable across different model sizes.
Wider is Better: In $\mu$ P, increasing model size always improves performance, which isn’t the case with SP.
Practical and Theoretical Impact: This method can significantly reduce the time and cost of HP tuning, making it feasible for researchers with limited resources to experiment with very large models.

Future Developments

This new tuning method changes the way we think about scaling models. It's not just about building larger and larger models anymore; it's about making the tuning process scalable too. Looking forward, this can democratize access to high-performance models, making sophisticated AI accessible to more researchers and industries.

Conclusion

Overall, $\mu$ Transfer and $\mu$ P represent a significant step forward in making HP tuning for large neural networks more efficient and less costly. This technique can potentially influence both academic research and practical applications, pushing the boundaries of what is feasible with AI. For intermediate data scientists, this means you can now aim higher with your model sizes without the dread of an unmanageable HP tuning process.

PDF Markdown

Related Papers

GitHub

GitHub - microsoft/mup: maximal update parametrization (µP) (1,338 stars)

Tweets

https://twitter.com/cloneofsimo/status/1787449589752357112

https://twitter.com/lu_sichu/status/1919919008188424355

https://twitter.com/AlgayresR/status/1847965085282484225

https://twitter.com/vikhyatk/status/1757058478634336548

https://twitter.com/dmsobol/status/1909413657211289693

https://twitter.com/thecharlieblake/status/1799029085827649930

YouTube

Show All Videos

HackerNews

Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer (3 points, 0 comments)