Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GapPredict: A Language Model for Resolving Gaps in Draft Genome Assemblies (2105.10552v2)

Published 21 May 2021 in q-bio.GN and cs.AI

Abstract: Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict, a tool that uses a character-level LLM to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome sequence assembly.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Eric Chen (35 papers)
  2. Justin Chu (2 papers)
  3. Jessica Zhang (6 papers)
  4. Rene L. Warren (2 papers)
  5. Inanc Birol (4 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.