1 research outputs found
Compression of high throughput sequencing data with probabilistic de Bruijn graph
Motivation: Data volumes generated by next-generation sequencing technolo-
gies is now a major concern, both for storage and transmission. This triggered
the need for more efficient methods than general purpose compression tools,
such as the widely used gzip. Most reference-free tools developed for NGS data
compression still use general text compression methods and fail to benefit from
algorithms already designed specifically for the analysis of NGS data. The goal
of our new method Leon is to achieve compression of DNA sequences of high
throughput sequencing data, without the need of a reference genome, with
techniques derived from existing assembly principles, that possibly better
exploit NGS data redundancy. Results: We propose a novel method, implemented in
the software Leon, for compression of DNA sequences issued from high throughput
sequencing technologies. This is a lossless method that does not need a
reference genome. Instead, a reference is built de novo from the set of reads
as a probabilistic de Bruijn Graph, stored in a Bloom filter. Each read is
encoded as a path in this graph, storing only an anchoring kmer and a list of
bifurcations indicating which path to follow in the graph. This new method will
allow to have compressed read files that also already contain its underlying de
Bruijn Graph, thus directly re-usable by many tools relying on this structure.
Leon achieved encoding of a C. elegans reads set with 0.7 bits/base,
outperforming state of the art reference-free methods. Availability: Open
source, under GNU affero GPL License, available for download at
http://gatb.inria.fr/software/leon/Comment: 21 page