A Text-guided Protein Design Framework (2302.04611v4)
Abstract: Current AI-assisted protein design mainly utilizes protein sequential and structural information. Meanwhile, there exists tremendous knowledge curated by humans in the text format describing proteins' high-level functionalities. Yet, whether the incorporation of such text data can help protein design tasks has not been explored. To bridge this gap, we propose ProteinDT, a multi-modal framework that leverages textual descriptions for protein design. ProteinDT consists of three subsequent steps: ProteinCLAP which aligns the representation of two modalities, a facilitator that generates the protein representation from the text modality, and a decoder that creates the protein sequences from the representation. To train ProteinDT, we construct a large dataset, SwissProtCLAP, with 441K text and protein pairs. We quantitatively verify the effectiveness of ProteinDT on three challenging tasks: (1) over 90% accuracy for text-guided protein generation; (2) best hit ratio on 12 zero-shot text-guided protein editing tasks; (3) superior performance on four out of six protein property prediction benchmarks.
- Chase R Freschlin, Sarah A Fahlberg and Philip A Romero “Machine learning to navigate fitness landscapes for protein engineering” In Current Opinion in Biotechnology 75, 2022, pp. 102713 DOI: 10.1016/j.copbio.2022.102713
- “Highly accurate protein structure prediction with AlphaFold” In Nature 596.7873 Nature Publishing Group, 2021, pp. 583–589
- “CryoDRGN2”, 2021, pp. 4066–4075
- “Learning inverse folding from millions of predicted structures” In bioRxiv Cold Spring Harbor Laboratory, 2022
- “Protein sequence design with deep generative models” In Current Opinion in Chemical Biology 65, Mechanistic Biology * Machine Learning in Chemical Biology, 2021, pp. 18–27 DOI: 10.1016/j.cbpa.2021.04.004
- “International Conference on Machine Learning”, 2021, pp. 8844–8856 PMLR
- “ProtTrans: towards cracking the language of Life’s code through self-supervised deep learning and high performance computing” In arXiv preprint arXiv:2007.06225, 2020
- “Language models enable zero-shot prediction of the effects of mutations on protein function” In bioRxiv, 2021 DOI: 10.1101/2021.07.09.450648
- “SESNet” In arXiv preprint arXiv:2301.00004 arXiv, 2022 DOI: 10.48550/arXiv.2301.00004
- “International Conference on Learning Representations”, 2021 URL: https://openreview.net/forum?id=1YLJDvSx6J4
- “Learning protein representations via complete 3d graph networks” In arXiv preprint arXiv:2207.12600, 2022
- “International Conference on Machine Learning”, 2021, pp. 8748–8763 PMLR
- “Glide: Towards photorealistic image generation and editing with text-guided diffusion models” In arXiv preprint arXiv:2112.10741, 2021
- “Hierarchical text-conditional image generation with clip latents” In arXiv preprint arXiv:2204.06125, 2022
- “Proceedings of the IEEE/CVF International Conference on Computer Vision”, 2021, pp. 2085–2094
- “International Conference on Artificial Intelligence and Statistics”, 2022, pp. 8906–8920 PMLR
- Carl Edwards, ChengXiang Zhai and Heng Ji “Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing”, 2021, pp. 595–607
- “A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals” In Nature communications 13.1 Nature Publishing Group, 2022, pp. 1–11
- “Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing” In arXiv preprint arXiv:2212.10789, 2022
- “ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback” In arXiv preprint arXiv:2305.18090, 2023
- UniProt Consortium “The universal protein resource (UniProt)” In Nucleic acids research 36.suppl_1 Oxford University Press, 2007, pp. D190–D195
- “Gene Ontology: tool for the unification of biology” In Nature Genetics 25.1 Nature Publishing Group, 2000, pp. 25–29
- UniProt “UniProtKG/Swiss-Prot”, https://www.expasy.org/resources/uniprotkb-swiss-prot, 2023
- “Plant Bioinformatics” Springer, 2007, pp. 89–112
- Carl Ivar Branden and John Tooze “Introduction to protein structure” Garland Science, 2012
- “Bert: Pre-training of deep bidirectional transformers for language understanding” In arXiv preprint arXiv:1810.04805, 2018
- “Attention is all you need” In Advances in neural information processing systems 30, 2017
- Martin Steinegger, Milot Mirdita and Johannes Söding “Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold” In Nature methods 16.7 Nature Publishing Group, 2019, pp. 603–606
- “Clustering huge protein sequence sets in linear time” In Nature communications 9.1 Nature Publishing Group, 2018, pp. 1–8
- Iz Beltagy, Kyle Lo and Arman Cohan “SciBERT: A pretrained language model for scientific text” In arXiv preprint arXiv:1903.10676, 2019
- Suzanne Fricke “Semantic scholar” In Journal of the Medical Library Association: JMLA 106.1 Medical Library Association, 2018, pp. 145
- “Galactica: A large language model for science” In arXiv preprint arXiv:2211.09085, 2022
- “ChatPathway: Conversational Large Language Models for Biology Pathway Detection” In NeurIPS 2023 AI for Science Workshop, 2023
- Neil Savage “Drug discovery companies are customizing ChatGPT: here’s how” In Nature Biotechnology, 2023
- “Difformer: Empowering Diffusion Model on Embedding Space for Text Generation” In arXiv preprint arXiv:2212.09412, 2022
- “GENIE: Large Scale Pre-training for Text Generation with Diffusion Model” In arXiv preprint arXiv:2212.11685, 2022
- “Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features” In Biopolymers: Original Research on Biomolecules 22.12 Wiley Online Library, 1983, pp. 2577–2637
- “Global analysis of protein folding using massively parallel design, synthesis, and testing” In Science 357.6347 American Association for the Advancement of Science, 2017, pp. 168–175
- “The protein data bank” In Nucleic acids research 28.1 Oxford University Press, 2000, pp. 235–242
- “A Multi-Grained Group Symmetric Framework for Learning Protein-Ligand Binding Dynamics” under review In Submitted to The Twelfth International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=J4V3lW9hq6
- “GNINA 1.0: molecular docking with deep learning” In Journal of cheminformatics 13.1 BioMed Central, 2021, pp. 1–20
- “Evaluating protein transfer learning with TAPE” In Advances in neural information processing systems 32, 2019
- “NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning” In Proteins: Structure, Function, and Bioinformatics 87.6 Wiley Online Library, 2019, pp. 520–527
- Jie Hou, Badri Adhikari and Jianlin Cheng “DeepSF: deep convolutional neural network for mapping protein sequences to folds” In Bioinformatics 34.8 Oxford University Press, 2018, pp. 1295–1303
- Naomi K Fox, Steven E Brenner and John-Marc Chandonia “SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures” In Nucleic Acids Research 42.D1 Oxford University Press, 2013, pp. D304–D309
- Mohammed AlQuraishi “ProteinNet: a standardized data set for machine learning of protein structure” In BMC Bioinformatics 20.1 Springer, 2019, pp. 1–10
- “Critical assessment of methods of protein structure prediction (CASP)-Round XII” In Proteins: Structure, Function, and Bioinformatics 86 John Wiley & Sons, Ltd, 2018, pp. 7–15 DOI: 10.1002/prot.25415
- “The protein data bank” In Nucleic Acids Research 28.1 Oxford University Press, 2000, pp. 235–242
- “Local fitness landscape of the green fluorescent protein” In Nature 533.7603 Nature Publishing Group, 2016, pp. 397
- “Long short-term memory” In Neural computation 9.8 MIT Press, 1997, pp. 1735–1780
- “Proceedings of the IEEE conference on computer vision and pattern recognition”, 2016, pp. 770–778
- “Ontoprotein: Protein pretraining with gene ontology embedding” In arXiv preprint:2201.11147, 2022
- “Illuminating protein space with a programmable generative model” In bioRxiv Cold Spring Harbor Laboratory, 2022 DOI: 10.1101/2022.12.01.518682
- “PubTator central” In Nucleic Acids Research 47.W1, 2019, pp. W587–W593 DOI: 10.1093/nar/gkz389
- “International Conference on Learning Representations”, 2020
- “Dual use of artificial-intelligence-powered drug discovery” In Nature Machine Intelligence 4.3, 2022, pp. 189–191 DOI: 10.1038/s42256-022-00465-9
- “Neural networks to learn protein sequence–function relationships from deep mutational scanning data” In Proceedings of the National Academy of Sciences 118.48, 2021, pp. e2104878118 DOI: 10.1073/pnas.2104878118
- “ECNet is an evolutionary context-integrated deep learning framework for protein engineering” In Nature Communications 12.1, 2021, pp. 5743 DOI: 10.1038/s41467-021-25976-8
- “Low-N protein engineering with data-efficient deep learning” In Nature Methods 18.4, 2021, pp. 389–396 DOI: 10.1038/s41592-021-01100-y
- “Proceedings of the 39th International Conference on Machine Learning” PMLR, 2022, pp. 16990–17017 URL: https://proceedings.mlr.press/v162/notin22a.html
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv preprint arXiv:1910.13461, 2019
- “Exploring the limits of transfer learning with a unified text-to-text transformer.” In J. Mach. Learn. Res. 21.140, 2020, pp. 1–67
- Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
- Pascal Vincent “A connection between score matching and denoising autoencoders” In Neural computation 23.7 MIT Press, 2011, pp. 1661–1674
- “Generative modeling by estimating gradients of the data distribution” In Advances in Neural Information Processing Systems 32, 2019
- “Score-based generative modeling through stochastic differential equations” In arXiv preprint arXiv:2011.13456, 2020
- “Learning deep representations by mutual information estimation and maximization” In arXiv preprint arXiv:1808.06670, 2018
- Philip Bachman, R Devon Hjelm and William Buchwalter “Learning representations by maximizing mutual information across views” In Advances in Neural Information Processing Systems 32, 2019
- Aaron van den Oord, Yazhe Li and Oriol Vinyals “Representation learning with contrastive predictive coding” In arXiv preprint arXiv:1807.03748, 2018
- “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2020, pp. 9729–9738
- “International Conference on Learning Representations”, 2022 URL: https://openreview.net/forum?id=xQUe1pOKPam
- “A tutorial on energy-based learning” In Predicting structured data 1.0, 2006
- “Supervised contrastive learning” In arXiv preprint arXiv:2004.11362, 2020
- Shengchao Liu, Hongyu Guo and Jian Tang “International Conference on Learning Representations”, 2023 URL: https://openreview.net/forum?id=CjTHVo1dvR
- “Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020” Association for Computational Linguistics, 2020, pp. 7871–7880 DOI: 10.18653/v1/2020.acl-main.703
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In J. Mach. Learn. Res. 21, 2020, pp. 140:1–140:67 URL: http://jmlr.org/papers/v21/20-074.html
- “Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020” ISCA, 2020, pp. 4676–4680 DOI: 10.21437/Interspeech.2020-1066
- “IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019, Singapore, December 14-18, 2019” IEEE, 2019, pp. 449–456 DOI: 10.1109/ASRU46091.2019.9003750
- “Muse: Text-To-Image Generation via Masked Generative Transformers” In arXiv preprint arXiv:2301.00704, 2023
- Yang Song and Diederik P Kingma “How to train your energy-based models” In arXiv preprint arXiv:2101.03288, 2021
- “Argmax flows and multinomial diffusion: Learning categorical distributions” In Advances in Neural Information Processing Systems 34, 2021, pp. 12454–12465
- “Structured denoising diffusion models in discrete state-spaces” In Advances in Neural Information Processing Systems 34, 2021, pp. 17981–17993
- “Diffusion-LM Improves Controllable Text Generation” In arXiv preprint arXiv:2205.14217, 2022
- “Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIII”, 2022, pp. 170–188 Springer
- “International Conference on Machine Learning”, 2021, pp. 8821–8831 PMLR
- Tero Karras, Samuli Laine and Timo Aila “Proceedings of the IEEE/CVF conference on computer vision and pattern recognition”, 2019, pp. 4401–4410
- “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In arXiv preprint arXiv:2205.11487, 2022
- “Structured Multi-View Representations for Drug Combinations” In Machine Learning for Molecules Workshop at NeurIPS, 2020
- “A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language” In arXiv preprint arXiv:2209.05481, 2022
- “Protein design and variant prediction using autoregressive generative models” In Nature Communications 12.1, 2021, pp. 2403 DOI: 10.1038/s41467-021-22732-w
- Noelia Ferruz, Steffen Schmidt and Birte Höcker “ProtGPT2 is a deep unsupervised language model for protein design” In Nature Communications 13.1, 2022, pp. 4348 DOI: 10.1038/s41467-022-32007-7
- “Disentangled Wasserstein Autoencoder for T-Cell Receptor Engineering” In Thirty-seventh Conference on Neural Information Processing Systems, 2023
- Zak Costello and Hector Garcia Martin “How to Hallucinate Functional Proteins” In arXiv:1903.00458 [q-bio], 2019 DOI: 10.48550/arXiv.1903.00458
- “A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences” In Cell Systems 11.1, 2020, pp. 49–62.e16 DOI: 10.1016/j.cels.2020.05.007
- “Generating functional protein variants with variational autoencoders” In PLOS Computational Biology 17.2, 2021, pp. e1008736 DOI: 10.1371/journal.pcbi.1008736
- “Transformer-based protein generation with regularized latent space optimization” In Nature Machine Intelligence 4.10, 2022, pp. 840–851 DOI: 10.1038/s42256-022-00532-1
- “ProT-VAE” In bioRxiv, 2023 DOI: 10.1101/2023.01.23.525232
- “De Novo Protein Design for Novel Folds Using Guided Conditional Wasserstein Generative Adversarial Networks” In Journal of Chemical Information and Modeling 60.12, 2020, pp. 5667–5681 DOI: 10.1021/acs.jcim.0c00593
- “Expanding functional protein sequence spaces using generative adversarial networks” In Nature Machine Intelligence 3.4, 2021, pp. 324–333 DOI: 10.1038/s42256-021-00310-5
- Damiano Sgarbossa, Umberto Lupo and Anne-Florence Bitbol “Generative power of a protein language model trained on multiple sequence alignments” In eLife 12, 2023, pp. e79854 DOI: 10.7554/eLife.79854
- “Designing Biological Sequences via Meta-Reinforcement Learning and Bayesian Optimization” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
- “Binding peptide generation for MHC Class I proteins with deep reinforcement learning” In Bioinformatics 39.2, 2023, pp. btad055
- “Designing a Prospective COVID-19 Therapeutic with Reinforcement Learning” In Machine Learning for Molecules Workshop at the 34th Conference on Neural Information Processing Systems, 2020
- “Protein Sequence Design in a Latent Space via Model-based Reinforcement Learning” In Machine Learning in Structural Biology Workshop at the 36th Conference on Neural Information Processing Systems, 2022
- “T-Cell Receptor Optimization with Reinforcement Learning and Mutation Polices for Precision Immunotherapy” In Research in Computational Molecular Biology Cham: Springer Nature Switzerland, 2023, pp. 174–191
- Wenze Ding, Kenta Nakai and Haipeng Gong “Protein design via deep learning” In Briefings in Bioinformatics 23.3, 2022, pp. bbac102 DOI: 10.1093/bib/bbac102
- “Advances in Neural Information Processing Systems” Curran Associates, Inc., 2019
- “Robust deep learning–based protein sequence design using ProteinMPNN” In Science 378.6615, 2022, pp. 49–56 DOI: 10.1126/science.add2187
- “Structure-informed Language Models Are Protein Designers” In bioRxiv, 2023, pp. 2023.02.03.526917 DOI: 10.1101/2023.02.03.526917
- “Protein Structure and Sequence Generation with Equivariant Denoising Diffusion Probabilistic Models” In arXiv preprint arXiv:2205.15019, 2022
- Jin Sub Lee and Philip M. Kim “ProteinSGM” In bioRxiv, 2022, pp. 2022.07.13.499967 DOI: 10.1101/2022.07.13.499967
- “Protein structure generation via folding diffusion” arXiv:2209.15611 [cs, q-bio] In arXiv:2209.15611 [q-bio], 2022 DOI: 10.48550/arXiv.2209.15611
- “Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models” In bioRxiv, 2022, pp. 2022.12.09.519842 DOI: 10.1101/2022.12.09.519842
- “A high-level programming language for generative protein design” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521526
- “Language models generalize beyond natural proteins” In bioRxiv, 2022 DOI: 10.1101/2022.12.21.521521
- “Diffusion models beat gans on image synthesis” In Advances in Neural Information Processing Systems 34, 2021, pp. 8780–8794
- “International conference on machine learning”, 2020, pp. 1597–1607 PMLR
- “Big self-supervised models are strong semi-supervised learners” In Advances in neural information processing systems 33, 2020, pp. 22243–22255
- “Cdconv: A benchmark for contradiction detection in chinese conversations” In arXiv preprint arXiv:2210.08511, 2022
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.