1,302 research outputs found
Bidirectional Text Compression in External Memory
Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme
One-pass adaptive universal vector quantization
The authors introduce a one-pass adaptive universal quantization technique for real, bounded alphabet, stationary sources. The algorithm is set on line without any prior knowledge of the statistics of the sources which it might encounter and asymptotically achieves ideal performance on all sources that it sees. The system consists of an encoder and a decoder. At increasing intervals, the encoder refines its codebook using knowledge about incoming data symbols. This codebook is then described to the decoder in the form of updates on the previous codebook. The accuracy to which the codebook is described increases as the number of symbols seen, and thus the accuracy to which the codebook is known, grows
On empirical cumulant generating functions of code lengths for individual sequences
We consider the problem of lossless compression of individual sequences using
finite-state (FS) machines, from the perspective of the best achievable
empirical cumulant generating function (CGF) of the code length, i.e., the
normalized logarithm of the empirical average of the exponentiated code length.
Since the probabilistic CGF is minimized in terms of the R\'enyi entropy of the
source, one of the motivations of this study is to derive an
individual-sequence analogue of the R\'enyi entropy, in the same way that the
FS compressibility is the individual-sequence counterpart of the Shannon
entropy. We consider the CGF of the code-length both from the perspective of
fixed-to-variable (F-V) length coding and the perspective of
variable-to-variable (V-V) length coding, where the latter turns out to yield a
better result, that coincides with the FS compressibility. We also extend our
results to compression with side information, available at both the encoder and
decoder. In this case, the V-V version no longer coincides with the FS
compressibility, but results in a different complexity measure.Comment: 15 pages; submitted for publicatio
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
We study the approximate string matching and regular expression matching
problem for the case when the text to be searched is compressed with the
Ziv-Lempel adaptive dictionary compression schemes. We present a time-space
trade-off that leads to algorithms improving the previously known complexities
for both problems. In particular, we significantly improve the space bounds,
which in practical applications are likely to be a bottleneck
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
- …