1,442 research outputs found
Weighted ancestors in suffix trees
The classical, ubiquitous, predecessor problem is to construct a data
structure for a set of integers that supports fast predecessor queries. Its
generalization to weighted trees, a.k.a. the weighted ancestor problem, has
been extensively explored and successfully reduced to the predecessor problem.
It is known that any solution for both problems with an input set from a
polynomially bounded universe that preprocesses a weighted tree in O(n
polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most
important and frequent application of the weighted ancestors problem is for
suffix trees. It has been a long-standing open question whether the weighted
ancestors problem has better bounds for suffix trees. We answer this question
positively: we show that a suffix tree built for a text w[1..n] can be
preprocessed using O(n) extra space, so that queries can be answered in O(1)
time. Thus we improve the running times of several applications. Our
improvement is based on a number of data structure tools and a
periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201
Weighted ancestors in suffix trees revisited
The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require O(log log n) time for queries provided O(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an O(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in O(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.Peer reviewe
Size-constrained Weighted Ancestors with Applications
The weighted ancestor problem on a rooted node-weighted tree is a
generalization of the classic predecessor problem: construct a data structure
for a set of integers that supports fast predecessor queries. Both problems are
known to require time for queries provided
space is available, where is the input
size. The weighted ancestor problem has attracted a lot of attention by the
combinatorial pattern matching community due to its direct application to
suffix trees. In this formulation of the problem, the nodes are weighted by
string depth. This attention has culminated in a data structure for weighted
ancestors in suffix trees with query time and an
-time construction algorithm [Belazzougui et al., CPM 2021]. In
this paper, we consider a different version of the weighted ancestor problem,
where the nodes are weighted by any function that maps the
nodes of to positive integers, such that for any node and if node is a descendant of node , where
is the number of nodes in the subtree rooted at . In the
size-constrained weighted ancestor (SWAQ) problem, for any node of and
any integer , we are asked to return the lowest ancestor of with
weight at least . We show that for any rooted tree with nodes, we can
locate node in time after -time
preprocessing. In particular, this implies a data structure for the SWAQ
problem in suffix trees with query time and
-time preprocessing, when the nodes are weighted by
. We also show several string-processing applications of this
result
Heaviest Induced Ancestors and Longest Common Substrings
Suppose we have two trees on the same set of leaves, in which nodes are
weighted such that children are heavier than their parents. We say a node from
the first tree and a node from the second tree are induced together if they
have a common leaf descendant. In this paper we describe data structures that
efficiently support the following heaviest-induced-ancestor query: given a node
from the first tree and a node from the second tree, find an induced pair of
their ancestors with maximum combined weight. Our solutions are based on a
geometric interpretation that enables us to find heaviest induced ancestors
using range queries. We then show how to use these results to build an
LZ-compressed index with which we can quickly find with high probability a
longest substring common to the indexed string and a given pattern
Random Access to Grammar Compressed Strings
Grammar based compression, where one replaces a long string by a small
context-free grammar that generates the string, is a simple and powerful
paradigm that captures many popular compression schemes. In this paper, we
present a novel grammar representation that allows efficient random access to
any character or substring without decompressing the string.
Let be a string of length compressed into a context-free grammar
of size . We present two representations of
achieving random access time, and either
construction time and space on the pointer machine model, or
construction time and space on the RAM. Here, is the inverse of
the row of Ackermann's function. Our representations also efficiently
support decompression of any substring in : we can decompress any substring
of length in the same complexity as a single random access query and
additional time. Combining these results with fast algorithms for
uncompressed approximate string matching leads to several efficient algorithms
for approximate string matching on grammar-compressed strings without
decompression. For instance, we can find all approximate occurrences of a
pattern with at most errors in time , where is the number of occurrences of in . Finally, we
generalize our results to navigation and other operations on grammar-compressed
ordered trees.
All of the above bounds significantly improve the currently best known
results. To achieve these bounds, we introduce several new techniques and data
structures of independent interest, including a predecessor data structure, two
"biased" weighted ancestor data structures, and a compact representation of
heavy paths in grammars.Comment: Preliminary version in SODA 201
Speeding-up -gram mining on grammar-based compressed texts
We present an efficient algorithm for calculating -gram frequencies on
strings represented in compressed form, namely, as a straight line program
(SLP). Given an SLP of size that represents string , the
algorithm computes the occurrence frequencies of all -grams in , by
reducing the problem to the weighted -gram frequencies problem on a
trie-like structure of size , where
is a quantity that represents the amount of
redundancy that the SLP captures with respect to -grams. The reduced problem
can be solved in linear time. Since , the running time of our
algorithm is , improving our
previous algorithm when
Cross-Document Pattern Matching
We study a new variant of the string matching problem called cross-document
string matching, which is the problem of indexing a collection of documents to
support an efficient search for a pattern in a selected document, where the
pattern itself is a substring of another document. Several variants of this
problem are considered, and efficient linear-space solutions are proposed with
query time bounds that either do not depend at all on the pattern size or
depend on it in a very limited way (doubly logarithmic). As a side result, we
propose an improved solution to the weighted level ancestor problem
Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic
Countless variants of the Lempel-Ziv compression are widely used in many
real-life applications. This paper is concerned with a natural modification of
the classical pattern matching problem inspired by the popularity of such
compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv
representation of a string t[1..N], does s occur in t? Farach and Thorup gave a
randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size
of the compressed representation of t. We improve their result by developing a
faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same
space complexity. Note that for highly compressible texts, log(N/n) might be of
order n, so for such inputs the improvement is very significant. A (tiny)
fragment of our method can be used to give an asymptotically optimal solution
for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte
Wavelet Trees Meet Suffix Trees
We present an improved wavelet tree construction algorithm and discuss its
applications to a number of rank/select problems for integer keys and strings.
Given a string of length n over an alphabet of size , our
method builds the wavelet tree in time,
improving upon the state-of-the-art algorithm by a factor of .
As a consequence, given an array of n integers we can construct in time a data structure consisting of machine words and
capable of answering rank/select queries for the subranges of the array in
time. This is a -factor improvement in
query time compared to Chan and P\u{a}tra\c{s}cu and a -factor
improvement in construction time compared to Brodal et al.
Next, we switch to stringological context and propose a novel notion of
wavelet suffix trees. For a string w of length n, this data structure occupies
words, takes time to construct, and simultaneously
captures the combinatorial structure of substrings of w while enabling
efficient top-down traversal and binary search. In particular, with a wavelet
suffix tree we are able to answer in time the following two
natural analogues of rank/select queries for suffixes of substrings: for
substrings x and y of w count the number of suffixes of x that are
lexicographically smaller than y, and for a substring x of w and an integer k,
find the k-th lexicographically smallest suffix of x.
We further show that wavelet suffix trees allow to compute a
run-length-encoded Burrows-Wheeler transform of a substring x of w in time, where s denotes the length of the resulting run-length encoding.
This answers a question by Cormode and Muthukrishnan, who considered an
analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201
- …