Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AlphaFold Distillation for Protein Design (2210.03488v2)

Published 5 Oct 2022 in q-bio.BM and cs.LG

Abstract: Inverse protein folding, the process of designing sequences that fold into a specific 3D structure, is crucial in bio-engineering and drug discovery. Traditional methods rely on experimentally resolved structures, but these cover only a small fraction of protein sequences. Forward folding models like AlphaFold offer a potential solution by accurately predicting structures from sequences. However, these models are too slow for integration into the optimization loop of inverse folding models during training. To address this, we propose using knowledge distillation on folding model confidence metrics, such as pTM or pLDDT scores, to create a faster and end-to-end differentiable distilled model. This model can then be used as a structure consistency regularizer in training the inverse folding model. Our technique is versatile and can be applied to other design tasks, such as sequence-based protein infilling. Experimental results show that our method outperforms non-regularized baselines, yielding up to 3% improvement in sequence recovery and up to 45% improvement in protein diversity while maintaining structural consistency in generated sequences. Code is available at https://github.com/IBM/AFDistill

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. GROMACS. https://www.gromacs.org/index.html.
  2. Solvent Accessible Surface Area. https://www.cgl.ucsf.edu/chimera/data/sasa-nov2013/sasa.html.
  3. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022.
  4. Protein sequence design with a learned potential. Nature communications, 13(1):1–11, 2022.
  5. De novo protein design by deep network hallucination. Nature, 600(7889):547–552, 2021.
  6. Patricia Van Arnum. Pharma Pulse: Top-Selling Small Molecules and Biologics, 2022. URL https://www.dcatvci.org/features/pharma-pulse-top/selling-small-molecules-biologics/.
  7. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021a.
  8. Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021b.
  9. Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design. In International Conference on Machine Learning, pp.  1261–1271. PMLR, 2021.
  10. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nature Biomedical Engineering, 5(6):613–623, 2021.
  11. Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. SAbDab: the structural antibody database. Nucleic Acids Research, 42(D1), 11 2013.
  14. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXiv preprint arXiv:2007.06225, 2020.
  15. HelixFold-Single: Msa-free protein structure prediction by using protein language model as an alternative. arXiv preprint arXiv:2207.13921, 2022.
  16. SIPF: Sampling Method for Inverse Protein Folding. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  378–388, 2022.
  17. Pifold: Toward effective and efficient protein inverse folding. In International Conference on Learning Representations, 2023.
  18. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015. URL http://arxiv.org/abs/1503.02531.
  19. Learning inverse folding from millions of predicted structures. bioRxiv, 2022.
  20. Generative models for graph-based protein design. Advances in neural information processing systems, 32, 2019.
  21. AlphaDesign: A de novo protein design framework based on AlphaFold. bioRxiv, 2021.
  22. Iterative refinement graph neural network for antibody sequence-structure co-design. arXiv preprint arXiv:2110.04624, 2021.
  23. Learning from protein structure with geometric vector perceptrons. arXiv preprint arXiv:2009.01411, 2020.
  24. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  25. De novo protein design for novel folds using guided conditional wasserstein generative adversarial networks. Journal of chemical information and modeling, 60(12):5667–5681, 2020.
  26. Design of a novel globular protein fold with atomic-level accuracy. Science, 302(5649):1364–1368, 2003.
  27. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp.  2980–2988, 2017.
  28. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  29. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics, 29(21):2722–2728, 2013.
  30. Using AlphaFold for rapid and accurate fixed backbone protein design. bioRxiv, 2021.
  31. Protein sequence design by explicit energy landscape optimization. bioRxiv, 2020.
  32. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021.
  33. State-of-the-art estimation of protein model accuracy using alphafold. Phys. Rev. Lett., 129, 2022.
  34. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Research, 50(D1):D439–D444, 11 2021.
  35. High-resolution de novo structure prediction from primary sequence. bioRxiv, 2022. doi: 10.1101/2022.07.21.500999. URL https://www.biorxiv.org/content/early/2022/07/22/2022.07.21.500999.
  36. Representing long-range context for graph neural networks with global attention. Advances in Neural Information Processing Systems, 34:13266–13279, 2021.
  37. Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3):1496–1503, 2020.
  38. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–710, 2004.
  39. Protein representation learning by geometric structure pretraining. arXiv preprint arXiv:2203.06125, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.