75 research outputs found
Lempel-Ziv Factorization May Be Harder Than Computing All Runs
The complexity of computing the Lempel-Ziv factorization and the set of all
runs (= maximal repetitions) is studied in the decision tree model of
computation over ordered alphabet. It is known that both these problems can be
solved by RAM algorithms in time, where is the length of
the input string and is the number of distinct letters in it. We prove
an lower bound on the number of comparisons required to
construct the Lempel-Ziv factorization and thereby conclude that a popular
technique of computation of runs using the Lempel-Ziv factorization cannot
achieve an time bound. In contrast with this, we exhibit an
decision tree algorithm finding all runs in a string. Therefore, in the
decision tree model the runs problem is easier than the Lempel-Ziv
factorization. Thus we support the conjecture that there is a linear RAM
algorithm finding all runs.Comment: 12 pages, 3 figures, submitte
Computing Runs on a General Alphabet
We describe a RAM algorithm computing all runs (maximal repetitions) of a
given string of length over a general ordered alphabet in
time and linear space. Our algorithm outperforms all
known solutions working in time provided , where is the alphabet size. We conjecture that there
exists a linear time RAM algorithm finding all runs.Comment: 4 pages, 2 figure
Algorithms to Compute the Lyndon Array
We first describe three algorithms for computing the Lyndon array that have
been suggested in the literature, but for which no structured exposition has
been given. Two of these algorithms execute in quadratic time in the worst
case, the third achieves linear time, but at the expense of prior computation
of both the suffix array and the inverse suffix array of x. We then go on to
describe two variants of a new algorithm that avoids prior computation of
global data structures and executes in worst-case n log n time. Experimental
evidence suggests that all but one of these five algorithms require only linear
execution time in practice, with the two new algorithms faster by a small
factor. We conjecture that there exists a fast and worst-case linear-time
algorithm to compute the Lyndon array that is also elementary (making no use of
global data structures such as the suffix array)
Compression with the tudocomp Framework
We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well on highly repetitive texts
Optimal-Time Text Indexing in BWT-runs Bounded Space
Indexing highly repetitive texts --- such as genomic databases, software
repositories and versioned text collections --- has become an important problem
since the turn of the millennium. A relevant compressibility measure for
repetitive texts is , the number of runs in their Burrows-Wheeler Transform
(BWT). One of the earliest indexes for repetitive collections, the Run-Length
FM-index, used space and was able to efficiently count the number of
occurrences of a pattern of length in the text (in loglogarithmic time per
pattern symbol, with current techniques). However, it was unable to locate the
positions of those occurrences efficiently within a space bounded in terms of
. Since then, a number of other indexes with space bounded by other measures
of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size
of the smallest grammar generating the text, the size of the smallest automaton
recognizing the text factors --- have been proposed for efficiently locating,
but not directly counting, the occurrences of a pattern. In this paper we close
this long-standing problem, showing how to extend the Run-Length FM-index so
that it can locate the occurrences efficiently within space (in
loglogarithmic time each), and reaching optimal time within
space, on a RAM machine of bits. Within
space, our index can also count in optimal time .
Raising the space to , we support count and locate in
and time, which is optimal in the
packed setting and had not been obtained before in compressed space. We also
describe a structure using space that replaces the text and
extracts any text substring of length in almost-optimal time
. (...continues...
Almost Linear Time Computation of Maximal Repetitions in Run Length Encoded Strings
We consider the problem of computing all maximal repetitions contained in a string that is given in run-length encoding.
Given a run-length encoding of a string, we show that the maximum number of maximal repetitions contained in the string is at most m+k-1, where m is the size of the run-length encoding, and k is the number of run-length factors whose exponent is at least 2.
We also show an algorithm for computing all maximal repetitions in O(m alpha(m)) time and O(m) space, where alpha denotes the inverse Ackermann function
Finding the Leftmost Critical Factorization on Unordered Alphabet
We present a linear time and space algorithm computing the leftmost critical
factorization of a given string on an unordered alphabet.Comment: 13 pages, 13 figures (accepted to Theor. Comp. Sci.
Compressibility-Aware Quantum Algorithms on Strings
Sublinear time quantum algorithms have been established for many fundamental
problems on strings. This work demonstrates that new, faster quantum algorithms
can be designed when the string is highly compressible. We focus on two popular
and theoretically significant compression algorithms -- the Lempel-Ziv77
algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT),
and obtain the results below.
We first provide a quantum algorithm running in time
for finding the LZ77 factorization of an input string with
factors. Combined with multiple existing results, this yields an
time quantum algorithm for finding the RL-BWT encoding
with BWT runs. Note that . We complement these
results with lower bounds proving that our algorithms are optimal (up to
polylog factors).
Next, we study the problem of compressed indexing, where we provide a
time quantum algorithm for constructing a recently
designed space structure with equivalent capabilities as the
suffix tree. This data structure is then applied to numerous problems to obtain
sublinear time quantum algorithms when the input is highly compressible. For
example, we show that the longest common substring of two strings of total
length can be computed in time, where is the
number of factors in the LZ77 factorization of their concatenation. This beats
the best known time quantum algorithm when is
sufficiently small
Linear Time Runs Over General Ordered Alphabets
A run in a string is a maximal periodic substring. For example, the string
contains the runs
and . There are less than runs in any
length- string, and computing all runs for a string over a linearly-sortable
alphabet takes time (Bannai et al., SODA 2015). Kosolobov
conjectured that there also exists a linear time runs algorithm for general
ordered alphabets (Inf. Process. Lett. 2016). The conjecture was almost proven
by Crochemore et al., who presented an time algorithm
(where is the extremely slowly growing inverse Ackermann function).
We show how to achieve time by exploiting combinatorial
properties of the Lyndon array, thus proving Kosolobov's conjecture.Comment: This work has been submitted to ICALP 202
One-dimensional staged self-assembly
17th International Conference, DNA 17, Pasadena, CA, USA, September 19-23, 2011. ProceedingsWe introduce the problem of staged self-assembly of one-dimensional nanostructures, which becomes interesting when the elements are labeled (e.g., representing functional units that must be placed at specific locations). In a restricted model in which each operation has a single terminal assembly, we prove that assembling a given string of labels with the fewest stages is equivalent, up to constant factors, to compressing the string to be uniquely derived from the smallest possible context-free grammar (a well-studied O(logn)-approximable problem). Without this restriction, we show that the optimal assembly can be substantially smaller than the optimal context-free grammar, by a factor of Ω √n/log n even for binary strings of length n. Fortunately, we can bound this separation in model power by a quadratic function in the number of distinct glues or tiles allowed in the assembly, which is typically small in practice
- …