14 research outputs found
Optimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length from an integer
alphabet of size so that access and substring equality queries (that
is, determining the equality of any two substrings) can be answered
efficiently. Any uniquely-decodable encoding supporting access must take
bits. We describe a new data
structure matching this lower bound when while supporting
both queries in optimal time. Furthermore, we show that the string can
be overwritten in-place with this structure. The redundancy of
bits and the constant query time break exponentially a lower bound that is
known to hold in the read-only model. Using our new string representation, we
obtain the first in-place subquadratic (indeed, even sublinear in some cases)
algorithms for several string-processing problems in the restore model: the
input string is rewritable and must be restored before the computation
terminates. In particular, we describe the first in-place subquadratic Monte
Carlo solutions to the sparse suffix sorting, sparse LCP array construction,
and suffix selection problems. With the sole exception of suffix selection, our
algorithms are also the first running in sublinear time for small enough sets
of input suffixes. Combining these solutions, we obtain the first
sublinear-time Monte Carlo algorithm for building the sparse suffix tree in
compact space. We also show how to derandomize our algorithms using small
space. This leads to the first Las Vegas in-place algorithm computing the full
LCP array in time and to the first Las Vegas in-place algorithms
solving the sparse suffix sorting and sparse LCP array construction problems in
time. Running times of these Las Vegas
algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las
Vegas algorithm
Fully dynamic data structure for LCE queries in compressed space
A Longest Common Extension (LCE) query on a text of length asks for
the length of the longest common prefix of suffixes starting at given two
positions. We show that the signature encoding of size [Mehlhorn et al., Algorithmica 17(2):183-198,
1997] of , which can be seen as a compressed representation of , has a
capability to support LCE queries in time,
where is the answer to the query, is the size of the Lempel-Ziv77
(LZ77) factorization of , and is an integer that can be handled
in constant time under word RAM model. In compressed space, this is the fastest
deterministic LCE data structure in many cases. Moreover, can be
enhanced to support efficient update operations: After processing
in time, we can insert/delete any (sub)string of length
into/from an arbitrary position of in time, where . This yields
the first fully dynamic LCE data structure. We also present efficient
construction algorithms from various types of inputs: We can construct
in time from uncompressed string ; in
time from grammar-compressed string
represented by a straight-line program of size ; and in time from LZ77-compressed string with factors. On top
of the above contributions, we show several applications of our data structures
which improve previous best known results on grammar-compressed string
processing.Comment: arXiv admin note: text overlap with arXiv:1504.0695
Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries
We present the first thorough practical study of the Lempel-Ziv-78 and the
Lempel-Ziv-Welch computation based on trie data structures. With a careful
selection of trie representations we can beat well-tuned popular trie data
structures like Judy, m-Bonsai or Cedar
Space Efficient Construction of Lyndon Arrays in Linear Time
Given a string S of length n, its Lyndon array identifies for each suffix S[i..n] the next lexicographically smaller suffix S[j..n], i.e. the minimal index j > i with S[i..n] ? S[j..n]. Apart from its plain (n log? n)-bit array representation, the Lyndon array can also be encoded as a succinct parentheses sequence that requires only 2n bits of space. While linear time construction algorithms for both representations exist, it has previously been unknown if the same time bound can be achieved with less than ?(n lg n) bits of additional working space. We show that, in fact, o(n) additional bits are sufficient to compute the succinct 2n-bit version of the Lyndon array in linear time. For the plain (n log? n)-bit version, we only need ?(1) additional words to achieve linear time. Our space efficient construction algorithm makes the Lyndon array more accessible as a fundamental data structure in applications like full-text indexing
Tight lower bounds for the longest common extension problem
The longest common extension problem is to preprocess a given string of length n into a data structure that uses S(n) bits on top of the input and answers in T(n) time the queries LCE(i, j) computing the length of the longest string that occurs at both positions i and j in the input. We prove that the trade-off S (n)T (n) = (it logn) holds in the non-uniform cell-probe model provided that the input string is read-only, each letter occupies a separate memory cell, S(n) = Omega(n), and the size of the input alphabet is at least 2(8inverted right perpendicularS(n)/ninverted left perpendicular). It is known that this trade-off is tight. (C) 2017 Elsevier B.V. All rights reserved.Peer reviewe
Locally Consistent Parsing for Text Indexing in Small Space
We consider two closely related problems of text indexing in a sub-linear
working space. The first problem is the Sparse Suffix Tree (SST) construction
of a set of suffixes using only words of space. The second problem
is the Longest Common Extension (LCE) problem, where for some parameter
, the goal is to construct a data structure that uses words of space and can compute the longest common prefix length of
any pair of suffixes. We show how to use ideas based on the Locally Consistent
Parsing technique, that was introduced by Sahinalp and Vishkin [STOC '94], in
some non-trivial ways in order to improve the known results for the above
problems. We introduce new Las-Vegas and deterministic algorithms for both
problems.
We introduce the first Las-Vegas SST construction algorithm that takes
time. This is an improvement over the last result of Gawrychowski and Kociumaka
[SODA '17] who obtained time for Monte-Carlo algorithm, and
time for Las-Vegas algorithm. In addition, we introduce a
randomized Las-Vegas construction for an LCE data structure that can be
constructed in linear time and answers queries in time.
For the deterministic algorithms, we introduce an SST construction algorithm
that takes time (for ). This is
the first almost linear time, , deterministic SST
construction algorithm, where all previous algorithms take at least
time. For the LCE problem, we
introduce a data structure that answers LCE queries in
time, with construction time (for ).
This data structure improves both query time and construction time upon the
results of Tanimura et al. [CPM '16].Comment: Extended abstract to appear is SODA 202