756 research outputs found
Longest Common Extensions in Sublinear Space
The longest common extension problem (LCE problem) is to construct a data
structure for an input string of length that supports LCE
queries. Such a query returns the length of the longest common prefix of the
suffixes starting at positions and in . This classic problem has a
well-known solution that uses space and query time. In this paper
we show that for any trade-off parameter , the problem can
be solved in space and query time. This
significantly improves the previously best known time-space trade-offs, and
almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201
Optimal Substring-Equality Queries with Applications to Sparse Text Indexing
We consider the problem of encoding a string of length from an integer
alphabet of size so that access and substring equality queries (that
is, determining the equality of any two substrings) can be answered
efficiently. Any uniquely-decodable encoding supporting access must take
bits. We describe a new data
structure matching this lower bound when while supporting
both queries in optimal time. Furthermore, we show that the string can
be overwritten in-place with this structure. The redundancy of
bits and the constant query time break exponentially a lower bound that is
known to hold in the read-only model. Using our new string representation, we
obtain the first in-place subquadratic (indeed, even sublinear in some cases)
algorithms for several string-processing problems in the restore model: the
input string is rewritable and must be restored before the computation
terminates. In particular, we describe the first in-place subquadratic Monte
Carlo solutions to the sparse suffix sorting, sparse LCP array construction,
and suffix selection problems. With the sole exception of suffix selection, our
algorithms are also the first running in sublinear time for small enough sets
of input suffixes. Combining these solutions, we obtain the first
sublinear-time Monte Carlo algorithm for building the sparse suffix tree in
compact space. We also show how to derandomize our algorithms using small
space. This leads to the first Las Vegas in-place algorithm computing the full
LCP array in time and to the first Las Vegas in-place algorithms
solving the sparse suffix sorting and sparse LCP array construction problems in
time. Running times of these Las Vegas
algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las
Vegas algorithm
Faster Longest Common Extension Queries in Strings over General Alphabets
Longest common extension queries (often called longest common prefix queries)
constitute a fundamental building block in multiple string algorithms, for
example computing runs and approximate pattern matching. We show that a
sequence of LCE queries for a string of size over a general ordered
alphabet can be realized in time making only
symbol comparisons. Consequently, all runs in a string over a general
ordered alphabet can be computed in time making
symbol comparisons. Our results improve upon a solution by Kosolobov
(Information Processing Letters, 2016), who gave an algorithm with running time and conjectured that time is possible. We
make a significant progress towards resolving this conjecture. Our techniques
extend to the case of general unordered alphabets, when the time increases to
. The main tools are difference covers and the
disjoint-sets data structure.Comment: Accepted to CPM 201
Substring Complexity in Sublinear Space
Shannon's entropy is a definitive lower bound for statistical compression.
Unfortunately, no such clear measure exists for the compressibility of
repetitive strings. Thus, ad-hoc measures are employed to estimate the
repetitiveness of strings, e.g., the size of the Lempel-Ziv parse or the
number of equal-letter runs of the Burrows-Wheeler transform. A more recent
one is the size of a smallest string attractor. Unfortunately, Kempa
and Prezza [STOC 2018] showed that computing is NP-hard. Kociumaka et
al. [LATIN 2020] considered a new measure that is based on the function
counting the cardinalities of the sets of substrings of each length of ,
also known as the substring complexity. This new measure is defined as and lower bounds all the measures previously
considered. In particular, always holds and can be
computed in time using working space. Kociumaka et
al. showed that if is given, one can construct an -sized representation of supporting efficient direct
access and efficient pattern matching queries on . Given that for highly
compressible strings, is significantly smaller than , it is natural
to pose the following question: Can we compute efficiently using
sublinear working space?
It is straightforward to show that any algorithm computing using
space requires time through a reduction
from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present
the following results: an -time and
-space algorithm to compute , for any ; and
an -time and -space algorithm to
compute , for any
String Indexing with Compressed Patterns
Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is given in compressed form and the goal is to achieve query time that is fast in terms of the compressed size of the pattern. This captures the common client-server scenario, where a client submits a query and communicates it in compressed form to a server. Instead of the server decompressing the query before processing it, we consider how to efficiently process the compressed query directly. Our main result is a novel linear space data structure that achieves near-optimal query time for patterns compressed with the classic Lempel-Ziv 1977 (LZ77) compression scheme. Along the way we develop several data structural techniques of independent interest, including a novel data structure that compactly encodes all LZ77 compressed suffixes of a string in linear space and a general decomposition of tries that reduces the search time from logarithmic in the size of the trie to logarithmic in the length of the pattern
Deterministic sub-linear space LCE data structures with efficient construction
Given a string of symbols, a longest common extension query
asks for the length of the longest common prefix of the
th and th suffixes of . LCE queries have several important
applications in string processing, perhaps most notably to suffix sorting.
Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015:
65-76) described several data structures for answering LCE queries that offers
a space-time trade-off between data structure size and query time. In
particular, for a parameter , their best deterministic
solution is a data structure of size which allows LCE queries to be
answered in time. However, the construction time for all
deterministic versions of their data structure is quadratic in . In this
paper, we propose a deterministic solution that achieves a similar space-time
trade-off of query time using
space, but significantly improve the construction time to
.Comment: updated titl
- …