8 research outputs found

    SRComp: Short Read Sequence Compression Using Burstsort and Elias Omega Coding

    Get PDF
    <div><p>Next-generation sequencing (NGS) technologies permit the rapid production of vast amounts of data at low cost. Economical data storage and transmission hence becomes an increasingly important challenge for NGS experiments. In this paper, we introduce a new non-reference based read sequence compression tool called SRComp. It works by first employing a fast string-sorting algorithm called burstsort to sort read sequences in lexicographical order and then Elias omega-based integer coding to encode the sorted read sequences. SRComp has been benchmarked on four large NGS datasets, where experimental results show that it can run 5–35 times faster than current state-of-the-art read sequence compression tools such as BEETL and SCALCE, while retaining comparable compression efficiency for large collections of short read sequences. SRComp is a read sequence compression tool that is particularly valuable in certain applications where compression time is of major concern.</p></div

    Comparison of CPU time needed to sort large collections of reads.

    No full text
    <p>Designations of these algorithm variants can be found in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0081414#pone.0081414-Sinha1" target="_blank">[14]</a>. Times are averaged over five runs.</p

    Comparison of compression performance of SRComp to gzip, bzip2, BEETL and SCALCE.

    No full text
    <p>BEETL is run in combination with PPMd, and SCALCE in combination with gzip. In the above, p-mem and d-time denote the compression peak memory usage (megabytes) and decompression CPU time, respectively. Times are averaged over five runs.</p

    Evaluation of SRComp on simulated datasets of varying read lengths and genome coverage depths.

    No full text
    <p>SCALCE's decompression crashed on two datasets tested, one consisting of 35 bp reads at 31 coverage and the other consisting of 50 bp reads at 44 coverage. Hence, their corresponding decompression times are indicated by a hyphen mark (βˆ’) in the above table. Times are averaged over five runs.</p

    Comparison of compression CPU time and bit rates of Elias omega coding to gzip and bzip2 on sorted read sequences.

    No full text
    <p>In the above, c-time means compression CPU time and bpb denotes bits per base. Times are averaged over five runs.</p

    Some basic statistics for datasets used in the experiments.

    No full text
    <p>As the paired-end reads from the experiment SRX006998 are of different length, we include in this dataset only reads from one end.</p

    A burst trie built from ten read sequences.

    No full text
    <p>The ten read sequences used are {CGCA, CAAG, TGCT, CGTG, CGTT, GACG, CACT, TGCT, CAAT, CGTG}. This burst trie has three trie nodes and five buckets. The maximum capacity of a bucket is assumed to be three read sequences.</p

    The algorithm overview for compression.

    No full text
    <p>(A) After five input read sequences are loaded in memory, we build two arrays of pointers. The first array (upper) contains pointers each of which points to a read sequence, whereas the second array (lower) contains pointers each of which points to an occurrence of the ambiguous base N. (B) Before burstsort starts, all the ambiguous bases are substituted with base G. During sorting, read sequences remain at the same physical place in memory and only their respective pointers in the first array are moved into sort order. At the end, read sequences are retrieved in order via the first pointer array. (C) Once the encoding of ordered read sequences is completed, all the ambiguous bases are substituted back via the second pointer array, which enables finding the location of every ambiguous base within the collection of sorted read sequences.</p
    corecore