Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (muP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call muTransfer: parametrize the target model in muP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify muTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via pip install mup
.
The paper introduces Maximal Update Parametrization (μP), a novel approach to efficient hyperparameter tuning for large neural networks.
μTransfer allows hyperparameters tuned on smaller models using μP to be effectively transferred to larger models, significantly reducing tuning costs.
Experimental results show that μTransfer achieves superior performance and cost efficiency compared to standard parametrizations for models like GPT-3 and BERT.
Hyperparameter (HP) tuning in deep learning can be incredibly expensive, especially when we're dealing with neural networks that have billions of parameters. Traditional HP tuning methods, like grid or random search, can be cost-prohibitive for these large models. This paper introduces a new trick called Maximal Update Parametrization ($\mu$P), which aims to make HP tuning more manageable.
The core idea here is called $\mu$Transfer. It works by tuning HPs on a smaller model that uses the $\mu$P parametrization and then transferring those HPs to a much larger model without any need for additional tuning. So, what's special about $\mu$P?
In standard parametrizations (SP), as the model size increases, the optimal HPs tend to change a lot, making it hard to predict the right HPs for a large model based on a smaller model. For instance, an optimal learning rate for a small model might cause a larger model to diverge during training. This inconsistency forces researchers to tune large models directly, which is very costly.
When using $\mu$P, the optimal HPs remain stable as the model size increases. This means:
The paper discusses experiments involving Transformers and ResNet models, and here's the gist of what they found:
This new tuning method changes the way we think about scaling models. It's not just about building larger and larger models anymore; it's about making the tuning process scalable too. Looking forward, this can democratize access to high-performance models, making sophisticated AI accessible to more researchers and industries.
Overall, $\mu$Transfer and $\mu$P represent a significant step forward in making HP tuning for large neural networks more efficient and less costly. This technique can potentially influence both academic research and practical applications, pushing the boundaries of what is feasible with AI. For intermediate data scientists, this means you can now aim higher with your model sizes without the dread of an unmanageable HP tuning process.