Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings (2402.08777v3)

Published 13 Feb 2024 in q-bio.GN, cs.AI, cs.CE, and cs.CL

Abstract: We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Genbank. Nucleic acids research, 41(D1):D36–D42, 2012.
  2. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  3. Cuco: Graph representation with curriculum contrastive learning. In IJCAI, pages 2300–2306, 2021.
  4. The nucleotide transformer: Building and evaluating robust foundation models for human genomics. bioRxiv, pages 2023–01, 2023.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  7. On the power of curriculum learning in training deep networks. In International conference on machine learning, pages 2535–2544. PMLR, 2019.
  8. Nucleosome positioning based on dna sequence embedding and deep learning. BMC genomics, 23(1):1–10, 2022.
  9. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  10. Metabat, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ, 3:e1165, 2015.
  11. Metabat 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ, 7:e7359, 2019.
  12. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  13. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  14. Metagenomic binning using connectivity-constrained variational autoencoders. In International Conference on Machine Learning, pages 18471–18481. PMLR, 2023.
  15. i-mix: A domain-agnostic strategy for contrastive representation learning. arXiv preprint arXiv:2010.08887, 2020.
  16. On the sentence embeddings from pre-trained language models, 2020.
  17. KGML-xDTD: a knowledge graph–based machine learning framework for drug treatment prediction and mechanism description. GigaScience, 12:giad057, 08 2023. ISSN 2047-217X. doi: 10.1093/gigascience/giad057. URL https://doi.org/10.1093/gigascience/giad057.
  18. Critical assessment of metagenome interpretation: the second round of challenges. Nature methods, 19(4):429–440, 2022.
  19. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  20. Patrick Ng. dna2vec: Consistent vector representations of variable-length k-mers. arXiv preprint arXiv:1701.06279, 2017.
  21. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. arXiv preprint arXiv:2306.15794, 2023.
  22. Improved metagenome binning and assembly using deep variational autoencoders. Nature biotechnology, 39(5):555–560, 2021.
  23. Deep contextualized word representations, 2018.
  24. Language models are unsupervised multitask learners. 2019.
  25. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  26. kmer2vec: A novel method for comparing dna sequences by word2vec embedding. Journal of Computational Biology, 29(9):1001–1021, 2022.
  27. Temporal contrastive learning with curriculum. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  28. Manifold mixup: Better representations by interpolating hidden states. In International conference on machine learning, pages 6438–6447. PMLR, 2019.
  29. Efficient contrastive learning via novel data augmentation and curriculum learning. arXiv preprint arXiv:2109.05941, 2021.
  30. Pairwise supervised contrastive learning of sentence representations. arXiv preprint arXiv:2109.05424, 2021.
  31. Learning dialogue representations from consecutive utterances. arXiv preprint arXiv:2205.13568, 2022.
  32. Dnabert-2: Efficient foundation model and benchmark for multi-species genome. arXiv preprint arXiv:2306.15006, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.