Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Clustering for Protein Representation Learning (2404.00254v1)

Published 30 Mar 2024 in cs.LG, cs.CE, q-bio.BM, and q-bio.QM

Abstract: Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article, we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph, where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering, until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification, enzyme reaction classification, gene ontology term prediction, and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. Efficient text document clustering approach using multi-search arithmetic optimization algorithm. Knowledge-Based Systems, 248:108833, 2022.
  2. A survey of text clustering algorithms. Mining text data, pages 77–128, 2012.
  3. Quasi-cliquepool: Hierarchical graph pooling for graph classification. In SAC, 2023.
  4. Enzynet: enzyme classification using 3d convolutional neural networks on spatial representation. PeerJ, 6:e4750, 2018.
  5. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE, 10(11):e0141287, 2015.
  6. Accurate learning of graph representations with graph multiset pooling. In ICLR, 2021.
  7. Graphqa: protein model quality assessment using graph convolutional networks. Bioinformatics, 37(3):360–366, 2021.
  8. Is the evolution of insulin darwinian or due to selectively neutral mutation? Nature, 257(5523):197–203, 1975.
  9. Introduction to protein structure. Garland Science, 2012.
  10. Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design. In ICML, 2021.
  11. Neural clustering based visual representation learning. In CVPR, 2024.
  12. Jigsaw clustering for unsupervised visual representation learning. In CVPR, 2021.
  13. Predicting protein quaternary structure by pseudo amino acid composition. Proteins: Structure, Function, and Bioinformatics, 53(2):282–289, 2003.
  14. Combining artificial immune system and clustering analysis: A stock market anomaly detection model. Journal of Intelligent Learning Systems and Applications, 12(04):83–108, 2020.
  15. Deep convolutional networks for quality assessment of protein folds. Bioinformatics, 34(23):4046–4053, 2018.
  16. Weighted graph cuts without eigenvectors a multilevel approach. IEEE TPAMI, 29(11):1944–1957, 2007.
  17. Clustering propagation for universal medical image segmentation. In CVPR, 2024.
  18. Russell F Doolittle. Similar amino acid sequences: chance or common ancestry? Science, 214(4517):149–159, 1981.
  19. Convolutional networks on graphs for learning molecular fingerprints. In NeurIPS, 2015.
  20. Prottrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE TPAMI, 2021.
  21. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, 1996.
  22. Continuous-discrete convolution for geometry-sequence modeling in proteins. In ICLR, 2023a.
  23. Pointlistnet: Deep learning on 3d point lists. In CVPR, 2023b.
  24. Clustering based point cloud representation learning for 3d analysis. In ICCV, 2023.
  25. Interpretable3d: an ad-hoc interpretable classifier for 3d point clouds. In AAAI, 2024.
  26. Nerve growth factor and insulin: Structural similarities indicate an evolutionary relationship reflected by physiological action. Science, 176(4034):482–488, 1972.
  27. Graph u-nets. In ICML, 2019.
  28. Topology-aware graph pooling networks. IEEE TPAMI, 43(12):4512–4518, 2021a.
  29. ipool-information-based pooling in hierarchical graph neural networks. IEEE TNNLS, 33(9):5032–5044, 2021b.
  30. Structure-based protein function prediction using graph convolutional networks. Nature Communications, 12(1):3168, 2021.
  31. Estimating the total number of protein folds. Proteins: Structure, Function, and Bioinformatics, 35(4):408–414, 1999.
  32. Cluster analysis of wine market segmentation-a consumer based study in the mid-atlantic usa. Economic Affairs, 63(1):151–157, 2018.
  33. Inductive representation learning on large graphs. In NeurIPS, 2017.
  34. Algorithm as 136: A k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics), 28(1):100–108, 1979.
  35. Deep residual learning for image recognition. In CVPR, 2016.
  36. Contrastive representation learning for 3d protein structures. In ICLR, 2022a.
  37. Contrastive representation learning for 3d protein structures. arXiv preprint arXiv:2205.15675, 2022b.
  38. Intrinsic-extrinsic convolution and pooling for learning on 3d protein structures. In ICLR, 2021.
  39. Deepsf: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics, 34(8):1295–1303, 2018.
  40. Generative models for graph-based protein design. In NeurIPS, 2019.
  41. Vernon M Ingram. Sickle-cell anemia hemoglobin: the molecular biology of the first “molecular disease”-the crucial importance of serendipity. Genetics, 167(1):1–7, 2004.
  42. Green market segmentation and consumer profiling: a cluster approach to an emerging consumer market. Benchmarking: An International Journal, 28(3):792–812, 2020.
  43. Text document clustering using spectral clustering algorithm with particle swarm optimization. Expert Systems with Applications, 134:192–200, 2019.
  44. Learning from protein structure with geometric vector perceptrons. In ICLR, 2021.
  45. Semi-supervised classification with graph convolutional networks. In ICML, 2017a.
  46. Semi-supervised classification with graph convolutional networks. In ICLR, 2017b.
  47. Understanding attention and generalization in graph neural networks. In NeurIPS, 2019.
  48. Deepgoplus: improved protein function prediction from sequence. Bioinformatics, 36(2):422–429, 2020.
  49. Deepgo: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics, 34(4):660–668, 2018.
  50. Self-attention graph pooling. In ICML, 2019.
  51. Gmmseg: Gaussian mixture based generative semantic segmentation models. In NeurIPS, 2022.
  52. Clustseg: Clustering for universal segmentation. In ICML, 2023.
  53. Graph pooling for graph neural networks: Progress, challenges, and opportunities. In IJCAI, 2023.
  54. Stuart Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, 1982.
  55. Zero-shot video grounding with pseudo query lookup and verification. IEEE TIP, 33:1643–1654, 2024.
  56. Clear: Cluster-enhanced contrast for self-supervised graph representation learning. IEEE TNNLS, 2022.
  57. Clique pooling for graph classification. In ICLR Workshop, 2019.
  58. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.
  59. On spectral clustering: Analysis and an algorithm. In NeurIPS, 2001.
  60. The intracellular polymerization of sickle hemoglobin and its relevance to sickle cell disease. Blood, 58(6):1057–1068, 1981.
  61. DAVID N Orth. Adrenocorticotropic hormone (acth). Methods of hormone radioimmunoassay, 2:245–278, 1979.
  62. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics, 34(17):i821–i829, 2018.
  63. The structure of proteins: two hydrogen-bonded helical configurations of the polypeptide chain. Proceedings of the National Academy of Sciences, 37(4):205–211, 1951.
  64. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017.
  65. Progressive transfer learning for face anti-spoofing. IEEE TIP, 30:3946–3955, 2021.
  66. Evaluating protein transfer learning with tape. In NeurIPS, 2019.
  67. Douglas A Reynolds et al. Gaussian mixture models. Encyclopedia of biometrics, 741(659-663), 2009.
  68. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. In National Academy of Sciences, 2021.
  69. Frederick Sanger. The arrangement of amino acids in proteins. In Advances in Protein Chemistry, pages 1–67, 1952.
  70. The amino-acid sequence in the phenylalanyl chain of insulin. 1. the identification of lower peptides from partial hydrolysates. Biochemical journal, 49(4):463, 1951.
  71. E (n) equivariant graph neural networks. In ICML, 2021.
  72. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
  73. Fast and flexible protein design using deep graph neural networks. Cell Systems, 11(4):402–411, 2020.
  74. The performance of bert as data representation of text clustering. Journal of Big Data, 9(1):1–21, 2022.
  75. Overview of protein structural and functional folds. Current Protocols in Protein Science, 35(1):17–1, 2004.
  76. End-to-end learning on 3d protein structure for interface prediction. In NeurIPS, 2019.
  77. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics, 35(2):309–318, 2019.
  78. Attention is all you need. In NeurIPS, 2017.
  79. Graph attention networks. In ICLR, 2018.
  80. Protchatgpt: Towards understanding proteins with large language models. arXiv preprint arXiv:2402.09649, 2024.
  81. Learning hierarchical protein representations via complete 3d graph networks. In ICLR, 2023a.
  82. Semi-supervised video object segmentation with super-trajectories. IEEE TPAMI, 41(4):985–998, 2018.
  83. Visual recognition with deep nearest centroids. In ICLR, 2023b.
  84. Second-order pooling for graph neural networks. IEEE TPAMI, 2020.
  85. Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction. Scientific Reports, 12(1):6832, 2022.
  86. R Wheatland. Molecular mimicry of acth in sars–implications for corticosteroid treatment and prophylaxis. Medical Hypotheses, 63(5):855–862, 2004.
  87. Groupvit: Semantic segmentation emerges from text supervision. In CVPR, 2022.
  88. How powerful are graph neural networks? In ICLR, 2019.
  89. Learned protein embeddings for machine learning. Bioinformatics, 34(15):2642–2648, 2018.
  90. Proposalcontrast: Unsupervised pre-training for lidar-based 3d object detection. In ECCV, 2022.
  91. Hierarchical graph representation learning with differentiable pooling. In NeurIPS, 2018.
  92. Online deep clustering for unsupervised representation learning. In CVPR, 2020.
  93. An end-to-end deep learning architecture for graph classification. In AAAI, 2018.
  94. Protein representation learning by geometric structure pretraining. In ICLR, 2023.
  95. Rethinking semantic segmentation: A prototype view. In CVPR, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com