Search CORE

8 research outputs found

SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date: 13/12/2013
Field of study

<div><p>Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.</p></div

CiteSeerX

Directory of Open Access Journals

PubMed Central

FigShare

Comparison of CPU time needed to sort large collections of reads.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>Designations of these algorithm variants can be found in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0081414#pone.0081414-Sinha1" target="_blank">[14]</a>. Times are averaged over five runs.</p

FigShare

Comparison of compression performance of SRComp to gzip, bzip2, BEETL and SCALCE.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>BEETL is run in combination with PPMd, and SCALCE in combination with gzip. In the above, p-mem and d-time denote the compression peak memory usage (megabytes) and decompression CPU time, respectively. Times are averaged over five runs.</p

FigShare

Evaluation of SRComp on simulated datasets of varying read lengths and genome coverage depths.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>SCALCE's decompression crashed on two datasets tested, one consisting of 35 bp reads at 31 coverage and the other consisting of 50 bp reads at 44 coverage. Hence, their corresponding decompression times are indicated by a hyphen mark (−) in the above table. Times are averaged over five runs.</p

FigShare

Comparison of compression CPU time and bit rates of Elias omega coding to gzip and bzip2 on sorted read sequences.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>In the above, c-time means compression CPU time and bpb denotes bits per base. Times are averaged over five runs.</p

FigShare

Some basic statistics for datasets used in the experiments.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>As the paired-end reads from the experiment SRX006998 are of different length, we include in this dataset only reads from one end.</p

FigShare

A burst trie built from ten read sequences.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>The ten read sequences used are {CGCA, CAAG, TGCT, CGTG, CGTT, GACG, CACT, TGCT, CAAT, CGTG}. This burst trie has three trie nodes and five buckets. The maximum capacity of a bucket is assumed to be three read sequences.</p

FigShare

The algorithm overview for compression.

Author: Jeremy John Selva (497370)
Xin Chen (14149)
Publication venue
Publication date
Field of study

<p>(A) After five input read sequences are loaded in memory, we build two arrays of pointers. The first array (upper) contains pointers each of which points to a read sequence, whereas the second array (lower) contains pointers each of which points to an occurrence of the ambiguous base N. (B) Before burstsort starts, all the ambiguous bases are substituted with base G. During sorting, read sequences remain at the same physical place in memory and only their respective pointers in the first array are moved into sort order. At the end, read sequences are retrieved in order via the first pointer array. (C) Once the encoding of ordered read sequences is completed, all the ambiguous bases are substituted back via the second pointer array, which enables finding the location of every ambiguous base within the collection of sorted read sequences.</p

FigShare