813 research outputs found
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Fast construction of FM-index for long sequence reads
Summary: We present a new method to incrementally construct the FM-index for
both short and long sequence reads, up to the size of a genome. It is the first
algorithm that can build the index while implicitly sorting the sequences in
the reverse (complement) lexicographical order without a separate sorting step.
The implementation is among the fastest for indexing short reads and the only
one that practically works for reads of averaged kilobases in length.
Availability and implementation: https://github.com/lh3/ropebwt2
Contact: [email protected]: 2 page
Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform
Motivation
The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for
compression and indexing of text data, but the cost of computing the BWT of
very large string collections has prevented these techniques from being widely
applied to the large sets of sequences often encountered as the outcome of DNA
sequencing experiments. In previous work, we presented a novel algorithm that
allows the BWT of human genome scale data to be computed on very moderate
hardware, thus enabling us to investigate the BWT as a tool for the compression
of such datasets.
Results
We first used simulated reads to explore the relationship between the level
of compression and the error rate, the length of the reads and the level of
sampling of the underlying genome and compare choices of second-stage
compression algorithm.
We demonstrate that compression may be greatly improved by a particular
reordering of the sequences in the collection and give a novel `implicit
sorting' strategy that enables these benefits to be realised without the
overhead of sorting the reads. With these techniques, a 45x coverage of real
human genome sequence data compresses losslessly to under 0.5 bits per base,
allowing the 135.3Gbp of sequence to fit into only 8.2Gbytes of space (trimming
a small proportion of low-quality bases from the reads improves the compression
still further).
This is more than 4 times smaller than the size achieved by a standard
BWT-based compressor (bzip2) on the untrimmed reads, but an important further
advantage of our approach is that it facilitates the building of compressed
full text indexes such as the FM-index on large-scale DNA sequence collections.Comment: Version here is as submitted to Bioinformatics and is same as the
previously archived version. This submission registers the fact that the
advanced access version is now available at
http://bioinformatics.oxfordjournals.org/content/early/2012/05/02/bioinformatics.bts173.abstract
. Bioinformatics should be considered as the original place of publication of
this article, please cite accordingl
Lyndon Array Construction during Burrows-Wheeler Inversion
In this paper we present an algorithm to compute the Lyndon array of a string
of length as a byproduct of the inversion of the Burrows-Wheeler
transform of . Our algorithm runs in linear time using only a stack in
addition to the data structures used for Burrows-Wheeler inversion. We compare
our algorithm with two other linear-time algorithms for Lyndon array
construction and show that computing the Burrows-Wheeler transform and then
constructing the Lyndon array is competitive compared to the known approaches.
We also propose a new balanced parenthesis representation for the Lyndon array
that uses bits of space and supports constant time access. This
representation can be built in linear time using words of space, or in
time using asymptotically the same space as
On Bijective Variants of the Burrows-Wheeler Transform
The sort transform (ST) is a modification of the Burrows-Wheeler transform
(BWT). Both transformations map an arbitrary word of length n to a pair
consisting of a word of length n and an index between 1 and n. The BWT sorts
all rotation conjugates of the input word, whereas the ST of order k only uses
the first k letters for sorting all such conjugates. If two conjugates start
with the same prefix of length k, then the indices of the rotations are used
for tie-breaking. Both transforms output the sequence of the last letters of
the sorted list and the index of the input within the sorted list. In this
paper, we discuss a bijective variant of the BWT (due to Scott), proving its
correctness and relations to other results due to Gessel and Reutenauer (1993)
and Crochemore, Desarmenien, and Perrin (2005). Further, we present a novel
bijective variant of the ST.Comment: 15 pages, presented at the Prague Stringology Conference 2009 (PSC
2009
- âŠ