DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings (2402.08777v3)

Published 13 Feb 2024 in q-bio.GN, cs.AI, cs.CE, and cs.CL

Abstract: We introduce DNABERT-S, a tailored genome model that develops species-aware embeddings to naturally cluster and segregate DNA sequences of different species in the embedding space. Differentiating species from genomic sequences (i.e., DNA and RNA) is vital yet challenging, since many real-world species remain uncharacterized, lacking known genomes for reference. Embedding-based methods are therefore used to differentiate species in an unsupervised manner. DNABERT-S builds upon a pre-trained genome foundation model named DNABERT-2. To encourage effective embeddings to error-prone long-read DNA sequences, we introduce Manifold Instance Mixup (MI-Mix), a contrastive objective that mixes the hidden representations of DNA sequences at randomly selected layers and trains the model to recognize and differentiate these mixed proportions at the output layer. We further enhance it with the proposed Curriculum Contrastive Learning (C$^2$LR) strategy. Empirical results on 23 diverse datasets show DNABERT-S's effectiveness, especially in realistic label-scarce scenarios. For example, it identifies twice more species from a mixture of unlabeled genomic sequences, doubles the Adjusted Rand Index (ARI) in species clustering, and outperforms the top baseline's performance in 10-shot species classification with just a 2-shot training. Model, codes, and data are publicly available at \url{https://github.com/MAGICS-LAB/DNABERT_S}.

References (32)

Citations (9)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/razoralign/status/1849155144405103004

https://twitter.com/MicrobiomeBot/status/1758319797778145724

https://twitter.com/MicrobiomeBot/status/1758007743896830036

https://twitter.com/KNM/status/1848967439754682701

DNABERT-S: Pioneering Species Differentiation with Species-Aware DNA Embeddings (2402.08777v3)

Summary

Related Papers

Tweets