Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 45 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 99 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO) (1910.04358v4)

Published 10 Oct 2019 in q-bio.GN and cs.IR

Abstract: DNA sequencing, especially of microbial genomes and metagenomes, has been at the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of sequence search on these terabyte and petabyte-scale datasets, including raw reads and assembled genomes. However, no known implementation provides both fast query and construction time, keeps the low false-positive requirement, and offers cheap storage of the data structure. We propose a data structure for search called RAMBO (Repeated And Merged BloOm Filter) which is significantly faster in query time than state-of-the-art genome indexing methods- COBS (Compact bit-sliced signature index), Sequence Bloom Trees, HowDeSBT, and SSBT. Furthermore, it supports insertion and query process parallelism, cheap updates for streaming inputs, has a zero false-negative rate, a low false-positive rate, and a small index size. RAMBO converts the search problem into set membership testing among $K$ documents. Interestingly, it is a count-min sketch type arrangement of a membership testing utility (Bloom Filter in our case). The simplicity of the algorithm and embarrassingly parallel architecture allows us to stream and index a 170TB whole-genome sequence dataset in a mere 9 hours on a cluster of 100 nodes while competing methods require weeks.

Citations (20)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.