Robust Optimization in Protein Fitness Landscapes Using Reinforcement Learning in Latent Space (2405.18986v1)
Abstract: Proteins are complex molecules responsible for different functions in nature. Enhancing the functionality of proteins and cellular fitness can significantly impact various industries. However, protein optimization using computational methods remains challenging, especially when starting from low-fitness sequences. We propose LatProtRL, an optimization method to efficiently traverse a latent space learned by an encoder-decoder leveraging a large protein LLM. To escape local optima, our optimization is modeled as a Markov decision process using reinforcement learning acting directly in latent space. We evaluate our approach on two important fitness optimization tasks, demonstrating its ability to achieve comparable or superior fitness over baseline methods. Our findings and in vitro evaluation show that the generated sequences can reach high-fitness regions, suggesting a substantial potential of LatProtRL in lab-in-the-loop scenarios.
- Protein structure and sequence generation with equivariant denoising diffusion probabilistic models. arXiv preprint arXiv:2205.15019, 2022.
- Model-based reinforcement learning for biological sequence design. In International conference on learning representations, 2019.
- Biological sequences design using batched bayesian optimization. 2019.
- Proteinbert: A universal deep-learning model of protein sequence and function. Bioinformatics, 38(8):2102–2110, 2022.
- Conditioning by adaptive sampling for robust design. In International conference on machine learning, pp. 773–782. PMLR, 2019.
- Deep diversification of an aav capsid protein by machine learning. Nature Biotechnology, 39(6):691–696, 2021.
- Relso: a transformer-based model for latent space optimization and generation of proteins. arXiv preprint arXiv:2201.09948, 2022.
- Flip: Benchmark tasks in fitness landscape inference for proteins. bioRxiv, pp. 2021–11, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- First return, then explore. Nature, 590(7847):580–586, 2021.
- Protein discovery with discrete walk-jump sampling. arXiv preprint arXiv:2306.12360, 2023.
- Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Stable baselines. https://github.com/hill-a/stable-baselines, 2018.
- The coming of age of de novo protein design. Nature, 537(7620):320–327, 2016.
- Biological sequence design with gflownets. In International Conference on Machine Learning, pp. 9786–9801. PMLR, 2022.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- The nk model of rugged fitness landscapes and its application to maturation of the immune response. Journal of theoretical biology, 141(2):211–245, 1989.
- Bootstrapped training of score-conditioned generator for offline design of biological sequences. arXiv preprint arXiv:2306.03111, 2023.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Improving protein optimization with smoothed fitness landscapes. In The Twelfth International Conference on Learning Representations, 2023.
- Kruskal, J. B. Nonmetric multidimensional scaling: a numerical method. Psychometrika, 29(2):115–129, 1964.
- Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
- Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval. In International Conference on Machine Learning, pp. 16990–17017. PMLR, 2022.
- Evaluating protein transfer learning with tape. Advances in neural information processing systems, 32, 2019.
- Proximal exploration for model-guided protein sequence design. In International Conference on Machine Learning, pp. 18520–18536. PMLR, 2022.
- Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
- Navigating the protein fitness landscape with gaussian processes. Proceedings of the National Academy of Sciences, 110(3):E193–E201, 2013.
- Local fitness landscape of the green fluorescent protein. Nature, 533(7603):397–401, 2016.
- Prediction of designer-recombinases for dna editing with generative deep learning. Nature Communications, 13(1):7966, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Adalead: A simple and robust adaptive greedy search algorithm for sequence design. arXiv preprint arXiv:2010.02141, 2020.
- Accelerating bayesian optimization for biological sequence design with denoising autoencoders. arXiv preprint arXiv:2203.12742, 2022.
- Amortized bayesian optimization over discrete spaces. In Conference on Uncertainty in Artificial Intelligence, pp. 769–778. PMLR, 2020.
- Quantitative analyses of empirical fitness landscapes. Journal of Statistical Mechanics: Theory and Experiment, 2013(01):P01005, 2013.
- Black-box optimization for automated discovery. Accounts of Chemical Research, 54(6):1334–1346, 2021.
- Self-play reinforcement learning guides protein engineering. Nature Machine Intelligence, 5(8):845–860, 2023.
- [cls] token is all you need for zero-shot semantic segmentation. arXiv preprint arXiv:2304.06212, 2023.
- Machine-learning-guided directed evolution for protein engineering. Nature methods, 16(8):687–694, 2019.
- Minji Lee (37 papers)
- Luiz Felipe Vecchietti (9 papers)
- Hyunkyu Jung (2 papers)
- Hyun Joo Ro (1 paper)
- Meeyoung Cha (63 papers)
- Ho Min Kim (3 papers)