Emergent Mind

Compression of high throughput sequencing data with probabilistic de Bruijn graph

(1412.5932)
Published Dec 18, 2014 in cs.DS and q-bio.QM

Abstract

Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip. Most reference-free tools developed for NGS data compression still use general text compression methods and fail to benefit from algorithms already designed specifically for the analysis of NGS data. The goal of our new method Leon is to achieve compression of DNA sequences of high throughput sequencing data, without the need of a reference genome, with techniques derived from existing assembly principles, that possibly better exploit NGS data redundancy. Results: We propose a novel method, implemented in the software Leon, for compression of DNA sequences issued from high throughput sequencing technologies. This is a lossless method that does not need a reference genome. Instead, a reference is built de novo from the set of reads as a probabilistic de Bruijn Graph, stored in a Bloom filter. Each read is encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph. This new method will allow to have compressed read files that also already contain its underlying de Bruijn Graph, thus directly re-usable by many tools relying on this structure. Leon achieved encoding of a C. elegans reads set with 0.7 bits/base, outperforming state of the art reference-free methods. Availability: Open source, under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.