Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics (2402.16901v1)

Published 24 Feb 2024 in q-bio.GN, cs.AI, and cs.LG

Abstract: Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer representations, limiting the capture of structurally relevant gene contexts. To address these limitations and further our understanding of complex relationships between metagenomic sequences and their functions, we introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. Our approach includes Masked Gene Modeling (MGM) for gene group-level pre-training, providing insights into inter-gene contextual information, and Triple Enhanced Metagenomic Contrastive Learning (TEM-CL) for gene-level pre-training to model gene sequence-function relationships. MGM and TEM-CL constitute our novel metagenomic LLM {\NAME}, pre-trained on 100 million metagenomic sequences. We demonstrate the superiority of our proposed {\NAME} on eight datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Uniprot: the universal protein knowledgebase in 2023. Nucleic Acids Research, 51(D1):D523–D531, 2023.
  2. A. Al-Ajlan and A. El Allali. Cnn-mgp: convolutional neural networks for metagenomics gene prediction. Interdisciplinary Sciences: Computational Life Sciences, 11:628–635, 2019.
  3. Diverse virus-encoded crispr-cas systems include streamlined genome editors. Cell, 185(24):4574–4586, 2022.
  4. M. Albertsen. Long-read metagenomics paves the way toward a complete microbial tree of life. Nature Methods, 20(1):30–31, 2023.
  5. Horizontal gene transfer and adaptive evolution in bacteria. Nature Reviews Microbiology, 20(4):206–218, 2022.
  6. Effective gene expression prediction from sequence by integrating long-range interactions. Nature methods, 18(10):1196–1203, 2021.
  7. A. Bairoch. The enzyme database in 2000. Nucleic acids research, 28(1):304–305, 2000.
  8. Vfdb: a reference database for bacterial virulence factors. Nucleic acids research, 33(suppl_1):D325–D328, 2005.
  9. Micrornas preferentially target the genes with high transcriptional regulation complexity. Biochemical and biophysical research communications, 352(3):733–738, 2007.
  10. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Redundancy of the genetic code enables translational pausing. Frontiers in genetics, 5:140, 2014.
  13. Prediction of operons in microbial genomes. Nucleic acids research, 29(5):1216–1221, 2001.
  14. Deep learning models for bacteria taxonomic classification of metagenomic data. BMC bioinformatics, 19:61–76, 2018.
  15. Patric: the comprehensive bacterial bioinformatics resource with a focus on human pathogenic species. Infection and immunity, 79(11):4286–4298, 2011.
  16. N. Gruber and J. N. Galloway. An earth-system perspective of the global nitrogen cycle. Nature, 451(7176):293–296, 2008.
  17. M. Gruenstaeudl. annonex2embl: automatic preparation of annotated dna sequences for bulk submissions to ena. Bioinformatics, 36(12):3841–3848, 2020.
  18. H.-J. Gwak and M. Rho. Vibe: a hierarchical bert model to identify eukaryotic viruses using metagenome sequencing data. Briefings in Bioinformatics, 23(4):bbac204, 2022.
  19. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  20. Deep learning of a bacterial and archaeal universal language of life enables transfer learning and illuminates microbial dark matter. Nature communications, 13(1):2606, 2022.
  21. S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  22. S. Hong and P. L. Pedersen. Atp synthase and the actions of inhibitors utilized to study its roles in human health, disease, and other scientific areas. Microbiology and molecular biology reviews, 72(4):590–641, 2008.
  23. Metagenomic discovery of novel crispr-cas13 systems. Cell Discovery, 8(1):107, 2022.
  24. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  25. Card 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic acids research, page gkw1004, 2016.
  26. Antibiotic resistance genes in bacteria: Occurrence, spread, and control. Journal of basic microbiology, 61(12):1049–1070, 2021.
  27. Global trends in emerging infectious diseases. Nature, 451(7181):990–993, 2008.
  28. The biocyc collection of microbial genomes and metabolic pathways. Briefings in bioinformatics, 20(4):1085–1093, 2019.
  29. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  30. S. J. Lee and M. Rho. Multimodal deep learning applied to classify healthy and disease states of human microbiome. Scientific Reports, 12(1):824, 2022.
  31. Deepmicrobes: taxonomic classification for metagenomics with deep learning. NAR Genomics and Bioinformatics, 2(1):lqaa009, 2020.
  32. Language models of protein sequences at the scale of evolution enable accurate structure prediction. bioRxiv, 2022.
  33. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637):1123–1130, 2023.
  34. Opportunities and challenges of using metagenomic data to bring uncultured microbes into cultivation. Microbiome, 10(1):76, 2022.
  35. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. Machine learning-aided engineering of hydrolases for pet depolymerization. Nature, 604(7907):662–667, 2022.
  37. Classification of metagenomic sequences: methods and challenges. Briefings in bioinformatics, 13(6):669–681, 2012.
  38. J. L. Martínez and F. Baquero. Interactions among strategies associated with bacterial infection: pathogenicity, epidemicity, and antibiotic resistance. Clinical microbiology reviews, 15(4):647–679, 2002.
  39. Machine learning and deep learning applications in metagenomic taxonomy and functional annotation. Frontiers in Microbiology, 13:811495, 2022.
  40. The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7):3348–3357, 2013.
  41. Virtifier: a deep learning-based identifier for viral sequences from metagenomes. Bioinformatics, 38(5):1216–1222, 2022.
  42. Deciphering microbial gene function using natural language processing. Nature Communications, 13(1):5731, 2022.
  43. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
  44. E. Nomenclature. Recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes, 1992.
  45. J. O’Neill. Tackling drug-resistant infections globally: final report and recommendations. 2016.
  46. Enhancing the interpretability of transcription factor binding site prediction using attention mechanism. Scientific reports, 10(1):13413, 2020.
  47. Influence of human genome polymorphism on gene expression. Human molecular genetics, 15(suppl_1):R9–R16, 2006.
  48. Unraveling the functional dark matter through global metagenomics. Nature, 622(7983):594–602, 2023.
  49. Identifying viruses from metagenomic data using deep learning. Quantitative Biology, 8:64–77, 2020.
  50. Mgnify: the microbiome sequence data analysis resource in 2023. Nucleic Acids Research, 51(D1):D753–D759, 2023.
  51. Using regulondb, the escherichia coli k-12 gene regulatory transcriptional network database. Current protocols in bioinformatics, 61(1):1–32, 2018.
  52. M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
  53. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods, 9(8):811–814, 2012.
  54. O. K. Tawfik and D. S. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annual review of biochemistry, 79:471–505, 2010.
  55. Ncycdb: a curated integrative database for fast and accurate metagenomic profiling of nitrogen cycling genes. Bioinformatics, 35(6):1040–1048, 2019.
  56. Metatransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genomics and Bioinformatics, 5(3):lqad082, 2023.
  57. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
  58. Deepte: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics, 36(15):4269–4275, 2020.
  59. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Computational and Structural Biotechnology Journal, 19:6301–6314, 2021.
  60. Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution. Nucleic acids research, 50(14):e81–e81, 2022.
  61. Args-oap: online analysis pipeline for antibiotic resistance genes detection from metagenomic data using an integrated structured arg-database. Bioinformatics, 32(15):2346–2351, 2016.
  62. Metagenomics assembled genome scale analysis revealed the microbial diversity and genetic polymorphism of lactiplantibacillus plantarum in traditional fermented foods of hainan, china. Food Research International, 150:110785, 2021.
  63. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
  64. Genslms: Genome-scale language models reveal sars-cov-2 evolutionary dynamics. The International Journal of High Performance Computing Applications, page 10943420231201154, 2022.
Citations (1)

Summary

We haven't generated a summary for this paper yet.