840 research outputs found
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Rust-Bio - a fast and safe bioinformatics library
We present Rust-Bio, the first general purpose bioinformatics library for the
innovative Rust programming language. Rust-Bio leverages the unique combination
of speed, memory safety and high-level syntax offered by Rust to provide a fast
and safe set of bioinformatics algorithms and data structures with a focus on
sequence analysis
BurrowsâWheeler compression: Principles and reflections
AbstractAfter a general description of the BurrowsâWheeler transform and a brief survey of recent work on processing its output, the paper examines the coding of the zero-runs from the MTF recoding stage, an aspect with little prior treatment. It is concluded that the original scheme proposed by Wheeler is extremely efficient and unlikely to be much improved.The paper then proposes some new interpretations and uses of the BurrowsâWheeler transform, with new insights and approaches to lossless compression, perhaps including techniques from error correction
Bidirectional Text Compression in External Memory
Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme
A visualization tool to explore alphabet orderings for the Burrows-Wheeler Transform
The Burrows-Wheeler Transform (BWT) is an efficient invertible text
transformation algorithm with the properties of tending to group identical
characters together in a run, and enabling search of the text. This
transformation has extensive uses particularly in lossless compression
algorithms, indexing, and within bioinformatics for sequence alignment tasks.
There has been recent interest in minimizing the number of identical character
runs () for a transform and in finding useful alphabet orderings for the
sorting step of the matrix associated with the BWT construction. This motivates
the inspection of many transforms while developing algorithms. However, the
full Burrows-Wheeler matrix is space and therefore very difficult to
display and inspect for large input sizes. In this paper we present a graphical
user interface (GUI) for working with BWTs, which includes features for
searching for matrix row prefixes, skipping over sections in the right-most
column (the transform), and displaying BWTs while exploring alphabet orderings
with the goal of minimizing the number of runs.Comment: 8 pages, 2 figure
Burrows Wheeler Compression Algorithm (BWCA) in Lossless Image Compression
The present paper discusses the implementation of BWCA in
lossless image compression. BWCA uses Burrows Wheeler
Transform (BWT) as its main transform. As one of combinatorial
compression algorithm which in particular reordered symbols
according to their following context, it becomes one of promising
approach in context modeling compression. BWT was initially
created for text compression, and here we study the impact of
BWCA method and its improvement when applied to image
compression. Since this application is quite different from the
original method aim, we analyze the pre- and post-processing
influences of BWT
Lossy Compressor preserving variant calling through Extended BWT
A standard format used for storing the output of high-throughput sequencing
experiments is the FASTQ format. It comprises three main components: (i)
headers, (ii) bases (nucleotide sequences), and (iii) quality scores. FASTQ
files are widely used for variant calling, where sequencing data are mapped
into a reference genome to discover variants that may be used for further
analysis. There are many specialized compressors that exploit redundancy in
FASTQ data with the focus only on either the bases or the quality scores
components. In this paper we consider the novel problem of lossy compressing,
in a reference-free way, FASTQ data by modifying both components at the same
time, while preserving the important information of the original FASTQ. We
introduce a general strategy, based on the Extended Burrows-Wheeler Transform
(EBWT) and positional clustering, and we present implementations in both
internal memory and external memory. Experimental results show that the lossy
compression performed by our tool is able to achieve good compression while
preserving information relating to variant calling more than the competitors.
Availability: the software is freely available at
https://github.com/veronicaguerrini/BFQzip.Comment: Proceedings of the 15th International Joint Conference on Biomedical
Engineering Systems and Technologie
- âŠ