Knowledge from Large-Scale Protein Contact Prediction Models Can Be Transferred to the Data-Scarce RNA Contact Prediction Task (2302.06120v3)
Abstract: RNA, whose functionality is largely determined by its structure, plays an important role in many biological activities. The prediction of pairwise structural proximity between each nucleotide of an RNA sequence can characterize the structural information of the RNA. Historically, this problem has been tackled by machine learning models using expert-engineered features and trained on scarce labeled datasets. Here, we find that the knowledge learned by a protein-coevolution Transformer-based deep neural network can be transferred to the RNA contact prediction task. As protein datasets are orders of magnitude larger than those for RNA contact prediction, our findings and the subsequent framework greatly reduce the data scarcity bottleneck. Experiments confirm that RNA contact prediction through transfer learning using a publicly available protein model is greatly improved. Our findings indicate that the learned structural patterns of proteins can be transferred to RNAs, opening up potential new avenues for research.
- Accurate prediction of protein structures and interactions using a three-track neural network. Science, 373(6557):871–876, 2021.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
- The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
- Simrna: a coarse-grained method for rna folding simulations and 3d structure prediction. Nucleic acids research, 44(7):e63–e63, 2016.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Protein structure determination from nmr chemical shifts. Proceedings of the National Academy of Sciences, 104(23):9615–9620, 2007.
- Atomic accuracy in predicting and designing noncanonical rna structure. Nature methods, 7(4):291–294, 2010.
- Robust deep learning–based protein sequence design using proteinmpnn. Science, 378(6615):49–56, 2022.
- Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, pp. 647–655. PMLR, 2014.
- Improved contact prediction in proteins: using pseudolikelihoods to infer potts models. Physical Review E, 87(1):012707, 2013.
- Esteller, M. Non-coding rnas in human disease. Nature reviews genetics, 12(12):861–874, 2011.
- Potent and specific genetic interference by double-stranded rna in caenorhabditis elegans. Nature, 391(6669):806–811, 1998.
- Glaeser, R. M. How good can cryo-em become? Nature methods, 13(1):28–32, 2016.
- Cross-lingual transfer of monolingual models. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 948–955, Marseille, France, June 2022. European Language Resources Association.
- Endogenous trna-derived fragments suppress breast cancer progression via ybx1 displacement. Cell, 161(4):790–802, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Mutation effects predicted from sequence co-variation. Nature biotechnology, 35(2):128–135, 2017.
- Direct: Rna contact predictions by integrating structural patterns. BMC bioinformatics, 20(1):1–12, 2019.
- Non-linguistic supervision for contrastive learning of sentence embeddings. In Advances in Neural Information Processing Systems, 2022.
- Psicov: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics, 28(2):184–190, 2012.
- Coarse-grained modeling of large rna molecules with knowledge-based potentials and structural filters. Rna, 15(2):189–199, 2009.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Template-based protein structure modeling using the raptorx web server. Nature protocols, 7(8):1511–1522, 2012.
- Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
- ifoldrna v2: folding rna with constraints. Bioinformatics, 31(17):2891–2893, 2015.
- Rosetta3: an object-oriented software suite for the simulation and design of macromolecules. In Methods in enzymology, volume 487, pp. 545–574. Elsevier, 2011.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Evolutionary-scale prediction of atomic level protein structure with a language model. bioRxiv, 2022.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
- Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021.
- Cryo-em advances in rna structure determination. Signal Transduction and Targeted Therapy, 7(1):1–6, 2022.
- The traveling observer model: Multi-task learning through spatial variable embeddings. In ICLR, 2021.
- Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
- Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences, 108(49):E1293–E1301, 2011.
- adabmdca: adaptive boltzmann machine learning for biological sequences. BMC bioinformatics, 22(1):1–19, 2021.
- Infernal 1.1: 100-fold faster rna homology searches. Bioinformatics, 29(22):2933–2935, 2013.
- Learning music helps you read: Using transfer to study linguistic structure in language models. In EMNLP, pp. 6829–6839, 01 2020. doi: 10.18653/v1/2020.emnlp-main.554.
- Automated 3d structure composition for large rnas. Nucleic acids research, 40(14):e112–e112, 2012.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Transformer protein language models are unsupervised structure learners. In International Conference on Learning Representations, 2021a.
- Msa transformer. In International Conference on Machine Learning, pp. 8844–8856. PMLR, 2021b.
- Moderna server: an online tool for modeling rna 3d structures. Bioinformatics, 27(17):2441–2442, 2011.
- Rna secondary structure prediction using deep learning with thermodynamic integration. Nature communications, 12(1):941, 2021.
- The petfold and petcofold web servers for intra-and intermolecular structures of multiple rna sequences. Nucleic acids research, 39(suppl_2):W107–W111, 2011.
- Biogenesis and function of trna fragments during sperm maturation and fertilization in mammals. Science, 351(6271):391–396, 2016.
- Redefining the invertebrate rna virosphere. Nature, 540(7634):539–543, 2016.
- Predicting 3d structure, flexibility, and stability of rna hairpins in monovalent and divalent ion solutions. Biophysical journal, 109(12):2654–2665, 2015.
- Predicting rna distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling. Bioinformatics, 38(16):3900–3910, 2022.
- Structure of rna and rna binding site in tobacco mosaic virus from 4-å map calculated from x-ray fibre diagrams. Nature, 267(5608):216–221, 1977.
- Enhanced prediction of rna solvent accessibility with long short-term memory neural networks and improved sequence profiles. Bioinformatics, 35(10):1686–1691, 2019.
- Rna inter-nucleotide 3d closeness prediction by deep residual neural networks. Bioinformatics, 37(8):1093–1098, 2021.
- Deep learning models for rna secondary structure prediction (probably) do not generalize across families. Bioinformatics, 38(16):3892–3899, 2022.
- Rna contact prediction by data efficient deep learning. Communications Biology, 6(1):913, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- 3drna v2. 0: An updated web server for rna 3d structure prediction. International Journal of Molecular Sciences, 20(17):4116, 2019.
- GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022. ISSN 2835-8856. URL https://openreview.net/forum?id=b4tMhpN0JC.
- Identification of direct residue contacts in protein–protein interaction by message passing. Proceedings of the National Academy of Sciences, 106(1):67–72, 2009.
- Vfold: a web server for rna structure and folding thermodynamics prediction. PloS one, 9(9):e107504, 2014.
- The i-tasser suite: protein structure and function prediction. Nature methods, 12(1):7–8, 2015.
- Improved protein structure prediction using predicted interresidue orientations. Proceedings of the National Academy of Sciences, 117(3):1496–1503, 2020.
- pydca v1. 0: a comprehensive software for direct coupling analysis of rna and protein sequences. Bioinformatics, 36(7):2264–2265, 2020.
- Coconet—boosting rna contact prediction by convolutional neural networks. Nucleic acids research, 49(22):12661–12672, 2021.
- Co-evolution transformer for protein contact prediction. Advances in Neural Information Processing Systems, 34:14252–14263, 2021a.
- Rnacmap: a fully automatic pipeline for predicting contact maps of rnas by evolutionary coupling analysis. Bioinformatics, 37(20):3494–3500, 2021b.
- Automated and fast building of three-dimensional rna structures. Scientific reports, 2(1):1–6, 2012.