Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders (2203.12742v2)

Published 23 Mar 2022 in cs.LG, cs.NE, q-bio.QM, and stat.ML

Abstract: Bayesian optimization (BayesOpt) is a gold standard for query-efficient continuous optimization. However, its adoption for drug design has been hindered by the discrete, high-dimensional nature of the decision variables. We develop a new approach (LaMBO) which jointly trains a denoising autoencoder with a discriminative multi-task Gaussian process head, allowing gradient-based optimization of multi-objective acquisition functions in the latent space of the autoencoder. These acquisition functions allow LaMBO to balance the explore-exploit tradeoff over multiple design rounds, and to balance objective tradeoffs by optimizing sequences at many different points on the Pareto frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce new tasks optimizing \emph{in silico} and \emph{in vitro} properties of large-molecule fluorescent proteins. In our experiments LaMBO outperforms genetic optimizers and does not require a large pretraining corpus, demonstrating that BayesOpt is practical and effective for biological sequence design.

Citations (74)

View on Semantic Scholar

Summary

The paper introduces LaMBO, a novel method that integrates a denoising autoencoder with a multi-task Gaussian process for biological sequence optimization.
It leverages a continuous latent space to enable gradient-based optimization, effectively addressing the challenges of high-dimensional discrete search spaces.
Experimental results demonstrate that LaMBO outperforms genetic algorithms in optimizing protein stability and solvent-accessible surface area, advancing the Pareto frontier.

Accelerating Bayesian Optimization for Biological Sequence Design with Denoising Autoencoders

The paper presents an innovative method called Latent Multi-Objective Bayesian Optimization (LaMBO) to effectively tackle the challenges inherent in biological sequence design. Biological sequence optimization holds significant promise for drug development, a field impeded by high costs and complex molecular interactions. The introduction of LaMBO suggests a promising pathway for optimization under the constraints of discrete, high-dimensional search spaces which characterize such biological tasks.

Key Contributions and Methodology

The research introduces a novel architecture that integrates a denoising autoencoder (DAE) with a multi-task Gaussian process (GP) head. This setup enables efficient Bayesian optimization by mapping sequences to a continuous latent space where gradient-based optimization becomes feasible. The DAE’s role is crucial as it learns robust, noise-resistant representations of sequences, which the GP head utilizes to make informed predictions about potential sequence queries. This integration allows LaMBO to handle the explore-exploit tradeoff effectively and navigate the Pareto frontier for multi-objective optimization tasks.

LaMBO was evaluated across a range of tasks, including small-molecule design and large-molecule (e.g., protein) optimization for properties such as folding stability and solvent-accessible surface area (SASA), crucial metrics for fluorescent proteins. Notably, LaMBO demonstrated superior performance to genetic algorithm (GA) baselines, showcasing not only higher sample efficiency but also improved solution quality without relying on large pretraining datasets.

Experimental Results

The empirical analysis highlights LaMBO's capability to advance the Pareto frontier over successive optimization rounds. For instance, in optimizing the stability and SASA of proteins, LaMBO found non-dominated and improved variants compared to ancestor proteins, underscoring its practical relevance for real-world applications. Additionally, the results from multi-objective tasks exhibited higher hypervolume improvement when compared with traditional GA methods.

Theoretical and Practical Implications

The theoretical advancements presented include a robust method for navigating high-dimensional and discrete optimization landscapes without substantial pretraining. LaMBO’s ability to incorporate multi-objective optimization through NEHVI (Noisy Expected Hypervolume Improvement) further solidifies its practical application. It highlights the potential of Bayesian optimization to effectively manage uncertainty and make sequential decisions that lead to optimal biological design, a critical aspect in drug development pipelines.

Speculations on Future Developments

The paper hints at several avenues for future exploration, including combining LaMBO with pre-trained biological models and improving the initialization step for mutation site selection. Additionally, the techniques discussed could inspire further refinements in non-myopic acquisition functions, integrating multi-modal inputs such as structural or genomic data, and extending these methods to address the complex, constrained problems often encountered in real-world drug discovery processes.

Overall, the research provides a compelling case for the use of advanced machine learning methods in biological sequence optimization. By leveraging a novel integration of denoising autoencoders and Bayesian inference, LaMBO stands to significantly enhance the efficiency and scope of drug design efforts. As the field progresses, such innovations will likely become critical tools in the continuous endeavor to improve health outcomes through advanced biotechnological development.

PDF Markdown

Related Papers

GitHub

GitHub - samuelstanton/lambo: Code to reproduce experiments in "Accelerating Bayesian Optimization for Protein Design with Denoising Autoencoders" (Stanton et al 2022) (61 stars)