Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning (1903.04703v1)

Published 12 Mar 2019 in cs.LG, math.OC, stat.ME, and stat.ML

Abstract: Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives --- for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations --- values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.

Citations (119)

View on Semantic Scholar

Summary

The paper introduces the trace-aware knowledge-gradient acquisition function to leverage trace observations and efficiently tune hyperparameters.
It employs multi-fidelity Bayesian optimization with stochastic gradient ascent for provably convergent hyperparameter tuning.
Empirical results on neural networks and kernel learning show significant improvements over methods like FaBOLAS, Hyperband, and BOCA.

Practical Multi-Fidelity Bayesian Optimization for Hyperparameter Tuning

The paper "Practical Multi-Fidelity Bayesian Optimization for Hyperparameter Tuning" dives into the intricacies of optimizing hyperparameters for machine learning models—specifically targeting deep neural networks and large-scale kernel learning—through a novel multi-fidelity approach. Hyperparameter tuning is crucial for improving model performance, but it often suffers from bottlenecks associated with the vast resources required for evaluating validation errors across varying hyperparameter settings. The authors address these challenges by proposing a multi-fidelity Bayesian optimization framework that employs trace-aware techniques to efficiently manage and leverage fidelity variations.

Core Contributions

The paper introduces the trace-aware knowledge-gradient (taKG) acquisition function, which is designed to utilize multi-fidelity observations, particularly focusing on the trace of model performance across a sequence of training iterations. This acquisition function is crafted to efficiently optimize hyperparameters by selecting fidelities—such as training data size, validation data size, and training iterations—that provide the most valuable information per unit cost. The proposed taKG is shown to outperform existing methods, like FaBOLAS, Hyperband, and BOCA, through both theoretical exposition and empirical validation.

Key aspects of this approach include:

Trace Observations: Instead of merely evaluating a single fidelity level, taKG uses trace observations, capturing the model's performance across various fidelities, allowing for better-informed decisions on which hyperparameters to focus.
Convergent Optimization: The authors provide a provably convergent method to optimize the taKG acquisition function even though these functions cannot be evaluated in a closed form. The technique employs stochastic gradient estimations and multistart stochastic gradient ascent, ensuring consistent optimization performance.
0-avoiding taKG ( $taKG$ ): This variant of the acquisition function is introduced to mitigate issues associated with low-fidelity evaluations, wherein they provide limited information yet consume computational resources. It avoids sampling at near-zero fidelities, presenting an elegant solution to a common problem in multi-fidelity optimization.

The paper also extends the use of taKG to both batch settings, where multiple evaluations can be conducted concurrently, and derivative-enabled settings, where gradient information is available — demonstrating its adaptability across different practical scenarios.

Numerical Results and Implications

Numerical experiments conducted across synthetic functions, neural network hyperparameter tuning (MNIST, CIFAR-10, SVHN), and kernel learning demonstrate significant improvements offered by taKG over other methods in the context of resource-efficient hyperparameter optimization. The insights gained from this work are particularly relevant for practitioners dealing with complex models that demand extensive computational resources, offering a robust framework to optimize rapidly varying hyperparameters efficiently.

From a theoretical standpoint, this work presents advancements in Bayesian optimization acquisition functions by incorporating sophisticated trace-aware techniques, potentially inspiring new lines of research that further harness multifidelity aspects in optimization strategies.

Future Directions

While the paper provides a substantial step forward in multifidelity Bayesian optimization, several avenues remain open for exploration. Future research may delve into the integration of dynamic fidelity scaling based on real-time computational exigencies or extending the capability of the taKG framework to encompass distributed systems, accommodating large-scale models across numerous computational nodes. Additionally, refining the model to handle a broader range of machine learning architectures, such as transformers and recurrent networks, may further enhance its applicability within the field.

Overall, this paper contributes significantly to the ongoing discussion about efficient hyperparameter optimization by blending theory with practical execution, allowing experts to leverage multifidelity approaches for optimized model performance with minimized computational expenditure.

PDF Markdown

Related Papers

YouTube

Show All Videos