1,442 research outputs found

    Weighted ancestors in suffix trees

    Full text link
    The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

    Weighted ancestors in suffix trees revisited

    Get PDF
    The weighted ancestor problem is a well-known generalization of the predecessor problem to trees. It is known to require O(log log n) time for queries provided O(n polylog n) space is available and weights are from [0..n], where n is the number of tree nodes. However, when applied to suffix trees, the problem, surprisingly, admits an O(n)-space solution with constant query time, as was shown by Gawrychowski, Lewenstein, and Nicholson (Proc. ESA 2014). This variant of the problem can be reformulated as follows: given the suffix tree of a string s, we need a data structure that can locate in the tree any substring s[p..q] of s in O(1) time (as if one descended from the root reading s[p..q] along the way). Unfortunately, the data structure of Gawrychowski et al. has no efficient construction algorithm, limiting its wider usage as an algorithmic tool. In this paper we resolve this issue, describing a data structure for weighted ancestors in suffix trees with constant query time and a linear construction algorithm. Our solution is based on a novel approach using so-called irreducible LCP values.Peer reviewe

    Size-constrained Weighted Ancestors with Applications

    Full text link
    The weighted ancestor problem on a rooted node-weighted tree TT is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require Ω(loglogn)\Omega(\log\log n) time for queries provided O(n polylogn)\mathcal{O}(n\text{ poly} \log n) space is available, where nn is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This attention has culminated in a data structure for weighted ancestors in suffix trees with O(1)\mathcal{O}(1) query time and an O(n)\mathcal{O}(n)-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function weight\textsf{weight} that maps the nodes of TT to positive integers, such that weight(u)size(u)\textsf{weight}(u)\le \textsf{size}(u) for any node uu and weight(u1)weight(u2)\textsf{weight}(u_1)\le \textsf{weight}(u_2) if node u1u_1 is a descendant of node u2u_2, where size(u)\textsf{size}(u) is the number of nodes in the subtree rooted at uu. In the size-constrained weighted ancestor (SWAQ) problem, for any node uu of TT and any integer kk, we are asked to return the lowest ancestor ww of uu with weight at least kk. We show that for any rooted tree with nn nodes, we can locate node ww in O(1)\mathcal{O}(1) time after O(n)\mathcal{O}(n)-time preprocessing. In particular, this implies a data structure for the SWAQ problem in suffix trees with O(1)\mathcal{O}(1) query time and O(n)\mathcal{O}(n)-time preprocessing, when the nodes are weighted by weight\textsf{weight}. We also show several string-processing applications of this result

    Heaviest Induced Ancestors and Longest Common Substrings

    Full text link
    Suppose we have two trees on the same set of leaves, in which nodes are weighted such that children are heavier than their parents. We say a node from the first tree and a node from the second tree are induced together if they have a common leaf descendant. In this paper we describe data structures that efficiently support the following heaviest-induced-ancestor query: given a node from the first tree and a node from the second tree, find an induced pair of their ancestors with maximum combined weight. Our solutions are based on a geometric interpretation that enables us to find heaviest induced ancestors using range queries. We then show how to use these results to build an LZ-compressed index with which we can quickly find with high probability a longest substring common to the indexed string and a given pattern

    Random Access to Grammar Compressed Strings

    Full text link
    Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let SS be a string of length NN compressed into a context-free grammar S\mathcal{S} of size nn. We present two representations of S\mathcal{S} achieving O(logN)O(\log N) random access time, and either O(nαk(n))O(n\cdot \alpha_k(n)) construction time and space on the pointer machine model, or O(n)O(n) construction time and space on the RAM. Here, αk(n)\alpha_k(n) is the inverse of the kthk^{th} row of Ackermann's function. Our representations also efficiently support decompression of any substring in SS: we can decompress any substring of length mm in the same complexity as a single random access query and additional O(m)O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern PP with at most kk errors in time O(n(min{Pk,k4+P}+logN)+occ)O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ), where occocc is the number of occurrences of PP in SS. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.Comment: Preliminary version in SODA 201

    Speeding-up qq-gram mining on grammar-based compressed texts

    Full text link
    We present an efficient algorithm for calculating qq-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T\mathcal{T} of size nn that represents string TT, the algorithm computes the occurrence frequencies of all qq-grams in TT, by reducing the problem to the weighted qq-gram frequencies problem on a trie-like structure of size m=Tdup(q,T)m = |T|-\mathit{dup}(q,\mathcal{T}), where dup(q,T)\mathit{dup}(q,\mathcal{T}) is a quantity that represents the amount of redundancy that the SLP captures with respect to qq-grams. The reduced problem can be solved in linear time. Since m=O(qn)m = O(qn), the running time of our algorithm is O(min{Tdup(q,T),qn})O(\min\{|T|-\mathit{dup}(q,\mathcal{T}),qn\}), improving our previous O(qn)O(qn) algorithm when q=Ω(T/n)q = \Omega(|T|/n)

    Cross-Document Pattern Matching

    Get PDF
    We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

    Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic

    Full text link
    Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte

    Wavelet Trees Meet Suffix Trees

    Full text link
    We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size σn\sigma\leq n, our method builds the wavelet tree in O(nlogσ/logn)O(n \log \sigma/ \sqrt{\log{n}}) time, improving upon the state-of-the-art algorithm by a factor of logn\sqrt{\log n}. As a consequence, given an array of n integers we can construct in O(nlogn)O(n \sqrt{\log n}) time a data structure consisting of O(n)O(n) machine words and capable of answering rank/select queries for the subranges of the array in O(logn/loglogn)O(\log n / \log \log n) time. This is a loglogn\log \log n-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a logn\sqrt{\log n}-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies O(n)O(n) words, takes O(nlogn)O(n \sqrt{\log n}) time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in O(logx)O(\log |x|) time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in O(slogx)O(s \log |x|) time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201
    corecore