VeLO: Training Versatile Learned Optimizers by Scaling Up (2211.09760v1)

Published 17 Nov 2022 in cs.LG, math.OC, and stat.ML

Abstract: While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. Meta-trained with approximately four thousand TPU-months of compute on a wide variety of optimization tasks, our optimizer not only exhibits compelling performance, but optimizes in interesting and unexpected ways. It requires no hyperparameter tuning, instead automatically adapting to the specifics of the problem being optimized. We open source our learned optimizer, meta-training code, the associated train and test data, and an extensive optimizer benchmark suite with baselines at velo-code.github.io.

Citations (52)

View on Semantic Scholar

Summary

The paper introduces VeLO, a meta-learned optimizer that automatically adapts across various tasks and outperforms hand-designed algorithms.
It employs a hierarchical architecture combining per-tensor LSTMs and per-parameter MLPs, leveraging evolution strategies for effective meta-training.
Evaluated on benchmarks like VeLOdrome and MLCommons, VeLO demonstrates significant speed gains and reduced tuning overhead compared to traditional optimizers.

VeLO: Meta-Learned Optimization

Introduction

The presented paper details the creation and evaluation of VeLO, a learned optimizer for deep learning, which demonstrates impressive performance across a variety of tasks. Unlike traditional optimizers that rely on hand-designed algorithms and hyperparameter tuning, VeLO is a neural network-based optimizer meta-trained using approximately four thousand TPU-months of computational resources. It shows remarkable adaptability and efficiency, requiring no hyperparameter tuning and automatically adjusting to the task at hand.

Training and Meta-Training of VeLO

VeLO represents a significant scaling of previous work in learned optimizers. The novel optimizer architecture is hierarchical, consisting of a per-tensor LSTM which processes aggregated information about tensor statistics, and a per-parameter MLP that operates on a small set of features. This design allows VeLO to remain computationally efficient while being expressive enough to capture complex optimization strategies.

The meta-training process employed Evolution Strategies (ES) to estimate gradients for the meta-objective. Meta-training covered an extensive set of tasks, including various network architectures and learning problems, ensuring that VeLO could generalize across different domains. To manage the computational cost, the training curriculum incorporated task-augmentation and rejection sampling, favoring tasks with shorter run times.

Evaluation Metrics and Baselines

VeLO was evaluated using multiple benchmarks:

VeLOdrome: Includes 83 canonical tasks with diverse model architectures.
MLCommons: Tests on a variety of large-scale tasks.
Real-world Tasks: Applications such as object detection, LLMs, and vision transformers.

Baselines included traditional optimizers like Adam, tuned over numerous hyperparameter trials, as well as learned optimizers from prior work. Evaluations revealed that VeLO outperformed these baselines across a broad range of tasks, even when the baselines received extensive hyperparameter tuning.

Numerical Results

The performance evaluation was compelling. On VeLOdrome tasks, VeLO was consistently faster and often achieved lower final losses than the best-tuned Adam optimizations. For example, on more than 50% of tasks, VeLO was over four times faster. In MLCommons benchmarks, VeLO either matched or outperformed heavily tuned Adam optimizers across several large-scale tasks without any hyperparameter tuning. The evaluation demonstrated that VeLO is particularly effective in scenarios where optimization is traditionally challenging, such as training large-scale vision transformers and decision transformers.

Implications and Future Work

This work illustrates a considerable advancement in optimization capabilities through meta-learning. By demonstrating that learned optimizers can outperform hand-designed counterparts without hyperparameter tuning, VeLO sets a new precedent for the capabilities of learned algorithms. Practically, this translates into reduced computational overhead in the training process and more robust performance across diverse and complex modeling tasks.

Theoretically, VeLO's success emphasizes the potential of meta-learning paradigms to discover superior optimization strategies, potentially revealing insights into better hand-design heuristics as well. Future research could expand on several fronts:

Enhanced Architectures: Improving the expressiveness and efficiency of the learned optimizer by incorporating advanced neural network designs or leveraging second-order information.
Task-specific Customization: Investigating mechanisms for task-specific conditioning to tailor optimization strategies dynamically.
Efficiency Improvements: Streamlining the meta-training process to reduce computational requirements and explore partial unroll techniques or analytic gradient methods.
Extended Generalization: Addressing current limitations in scaling to very large model sizes or extremely lengthy training steps by expanding the meta-training task distribution and improving the robustness of continuation strategies.

Conclusion

VeLO demonstrates the feasibility and advantages of using meta-learned optimization techniques over traditional hand-designed algorithms. By leveraging significant computational resources and a diverse training curriculum, VeLO achieves unparalleled generalization and performance across a wide array of tasks. This work not only advances the state-of-the-art in optimization but also underscores the broader potential of meta-learning to innovate and enhance various components of future machine learning pipelines.

PDF Markdown

Related Papers

Tweets

https://twitter.com/janson002/status/1934374917056709105

https://twitter.com/benjamintherien/status/1934380962759983398

https://twitter.com/janson002/status/1934369088480084045

https://twitter.com/HessianFree/status/1798942638626029810

YouTube

Show All Videos