168 research outputs found
Lightweight Lempel-Ziv Parsing
We introduce a new approach to LZ77 factorization that uses O(n/d) words of
working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet
sizes). We also describe carefully engineered implementations of alternative
approaches to lightweight LZ77 factorization. Extensive experiments show that
the new algorithm is superior in most cases, particularly at the lowest memory
levels and for highly repetitive data. As a part of the algorithm, we describe
new methods for computing matching statistics which may be of independent
interest.Comment: 12 page
Lempel-Ziv Parsing in External Memory
For decades, computing the LZ factorization (or LZ77 parsing) of a string has
been a requisite and computationally intensive step in many diverse
applications, including text indexing and data compression. Many algorithms for
LZ77 parsing have been discovered over the years; however, despite the
increasing need to apply LZ77 to massive data sets, no algorithm to date scales
to inputs that exceed the size of internal memory. In this paper we describe
the first algorithm for computing the LZ77 parsing in external memory. Our
algorithm is fast in practice and will allow the next generation of text
indexes to be realised for massive strings and string collections.Comment: 10 page
Bidirectional Text Compression in External Memory
Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external memory implementation. We evaluate it experimentally on large data sets of size up to 128 GiB (using only 16 GiB of RAM) and show that it is significantly faster than all known LZ77 compressors, while producing a roughly similar number of factors. We also introduce an external memory decompressor for texts compressed with any uni- or bidirectional compression scheme
Lempel-Ziv Compression in a Sliding Window
We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem.
Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size.
To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem
Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries
We present the first thorough practical study of the Lempel-Ziv-78 and the
Lempel-Ziv-Welch computation based on trie data structures. With a careful
selection of trie representations we can beat well-tuned popular trie data
structures like Judy, m-Bonsai or Cedar
Indexing Highly Repetitive String Collections
Two decades ago, a breakthrough in indexing string collections made it
possible to represent them within their compressed space while at the same time
offering indexed search functionalities. As this new technology permeated
through applications like bioinformatics, the string collections experienced a
growth that outperforms Moore's Law and challenges our ability of handling them
even in compressed form. It turns out, fortunately, that many of these rapidly
growing string collections are highly repetitive, so that their information
content is orders of magnitude lower than their plain size. The statistical
compression methods used for classical collections, however, are blind to this
repetitiveness, and therefore a new set of techniques has been developed in
order to properly exploit it. The resulting indexes form a new generation of
data structures able to handle the huge repetitive string collections that we
are facing.
In this survey we cover the algorithmic developments that have led to these
data structures. We describe the distinct compression paradigms that have been
used to exploit repetitiveness, the fundamental algorithmic ideas that form the
base of all the existing indexes, and the various structures that have been
proposed, comparing them both in theoretical and practical aspects. We conclude
with the current challenges in this fascinating field
Computing Runs on a General Alphabet
We describe a RAM algorithm computing all runs (maximal repetitions) of a
given string of length over a general ordered alphabet in
time and linear space. Our algorithm outperforms all
known solutions working in time provided , where is the alphabet size. We conjecture that there
exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure
String attractors and combinatorics on words
The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w[1]w[2] · · · w[n] is a subset Γ of the positions 1, . . ., n, such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word. In this paper we explore the notion of string attractor from a combinatorial point of view, by focusing on several families of finite words. The results presented in the paper suggest that the notion of string attractor can be used to define new tools to investigate combinatorial properties of the words
- …