193 research outputs found
Faster Compact On-Line Lempel-Ziv Factorization
We present a new on-line algorithm for computing the Lempel-Ziv factorization
of a string that runs in time and uses only bits
of working space, where is the length of the string and is the
size of the alphabet. This is a notable improvement compared to the performance
of previous on-line algorithms using the same order of working space but
running in either time (Okanohara & Sadakane 2009) or
time (Starikovskaya 2012). The key to our new algorithm is in the
utilization of an elegant but less popular index structure called Directed
Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an
opportunistic variant of our algorithm, which, given the run length encoding of
size of a string of length , computes the Lempel-Ziv factorization
on-line, in time
and bits of space, which is faster and more space efficient when
the string is run-length compressible
Lempel-Ziv Factorization May Be Harder Than Computing All Runs
The complexity of computing the Lempel-Ziv factorization and the set of all
runs (= maximal repetitions) is studied in the decision tree model of
computation over ordered alphabet. It is known that both these problems can be
solved by RAM algorithms in time, where is the length of
the input string and is the number of distinct letters in it. We prove
an lower bound on the number of comparisons required to
construct the Lempel-Ziv factorization and thereby conclude that a popular
technique of computation of runs using the Lempel-Ziv factorization cannot
achieve an time bound. In contrast with this, we exhibit an
decision tree algorithm finding all runs in a string. Therefore, in the
decision tree model the runs problem is easier than the Lempel-Ziv
factorization. Thus we support the conjecture that there is a linear RAM
algorithm finding all runs.Comment: 12 pages, 3 figures, submitte
A Grammar Compression Algorithm based on Induced Suffix Sorting
We introduce GCIS, a grammar compression algorithm based on the induced
suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution
builds on the factorization performed by SAIS during suffix sorting. We
construct a context-free grammar on the input string which can be further
reduced into a shorter string by substituting each substring by its
correspondent factor. The resulting grammar is encoded by exploring some
redundancies, such as common prefixes between suffix rules, which are sorted
according to SAIS framework. When compared to well-known compression tools such
as Re-Pair and 7-zip, our algorithm is competitive and very effective at
handling repetitive string regarding compression ratio, compression and
decompression running time
Fast online Lempel-Ziv factorization in compressed space
Let T be a text of length n on an alphabet \u3a3 of size \u3c3, and let H0 be the zero-order empirical entropy of T. We show that the LZ77 factorization of T can be computed in nH0+o(nlog\u3c3)+O(\u3c3logn) bits of working space with an online algorithm running in O(nlogn) time. Previous space-efficient online solutions either work in compact space and O(nlogn) time, or in succinct space and O(nlog3n) time
Universal Compressed Text Indexing
The rise of repetitive datasets has lately generated a lot of interest in
compressed self-indexes based on dictionary compression, a rich and
heterogeneous family that exploits text repetitions in different ways. For each
such compression scheme, several different indexing solutions have been
proposed in the last two decades. To date, the fastest indexes for repetitive
texts are based on the run-length compressed Burrows-Wheeler transform and on
the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on
the other hand, are based on the Lempel-Ziv parsing and on grammar compression.
Indexes for more universal schemes such as collage systems and macro schemes
have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed
that all dictionary compressors can be interpreted as approximation algorithms
for the smallest string attractor, that is, a set of text positions capturing
all distinct substrings. Starting from this observation, in this paper we
develop the first universal compressed self-index, that is, the first indexing
data structure based on string attractors, which can therefore be built on top
of any dictionary-compressed text representation. Let be the size of a
string attractor for a text of length . Our index takes
words of space and supports locating the
occurrences of any pattern of length in
time, for any constant . This is, in particular, the first index
for general macro schemes and collage systems. Our result shows that the
relation between indexing and compression is much deeper than what was
previously thought: the simple property standing at the core of all dictionary
compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
- …