464 research outputs found

    Wavelet Trees Meet Suffix Trees

    Full text link
    We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size σn\sigma\leq n, our method builds the wavelet tree in O(nlogσ/logn)O(n \log \sigma/ \sqrt{\log{n}}) time, improving upon the state-of-the-art algorithm by a factor of logn\sqrt{\log n}. As a consequence, given an array of n integers we can construct in O(nlogn)O(n \sqrt{\log n}) time a data structure consisting of O(n)O(n) machine words and capable of answering rank/select queries for the subranges of the array in O(logn/loglogn)O(\log n / \log \log n) time. This is a loglogn\log \log n-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a logn\sqrt{\log n}-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies O(n)O(n) words, takes O(nlogn)O(n \sqrt{\log n}) time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in O(logx)O(\log |x|) time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in O(slogx)O(s \log |x|) time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression.Comment: 33 pages, 5 figures; preliminary version published at SODA 201

    Minimal Suffix and Rotation of a Substring in Optimal Time

    Get PDF
    For a text given in advance, the substring minimal suffix queries ask to determine the lexicographically minimal non-empty suffix of a substring specified by the location of its occurrence in the text. We develop a data structure answering such queries optimally: in constant time after linear-time preprocessing. This improves upon the results of Babenko et al. (CPM 2014), whose trade-off solution is characterized by Θ(nlogn)\Theta(n\log n) product of these time complexities. Next, we extend our queries to support concatenations of O(1)O(1) substrings, for which the construction and query time is preserved. We apply these generalized queries to compute lexicographically minimal and maximal rotations of a given substring in constant time after linear-time preprocessing. Our data structures mainly rely on properties of Lyndon words and Lyndon factorizations. We combine them with further algorithmic and combinatorial tools, such as fusion trees and the notion of order isomorphism of strings

    Cross-Document Pattern Matching

    Get PDF
    We study a new variant of the string matching problem called cross-document string matching, which is the problem of indexing a collection of documents to support an efficient search for a pattern in a selected document, where the pattern itself is a substring of another document. Several variants of this problem are considered, and efficient linear-space solutions are proposed with query time bounds that either do not depend at all on the pattern size or depend on it in a very limited way (doubly logarithmic). As a side result, we propose an improved solution to the weighted level ancestor problem

    Deterministic sub-linear space LCE data structures with efficient construction

    Get PDF
    Given a string SS of nn symbols, a longest common extension query LCE(i,j)\mathsf{LCE}(i,j) asks for the length of the longest common prefix of the iith and jjth suffixes of SS. LCE queries have several important applications in string processing, perhaps most notably to suffix sorting. Recently, Bille et al. (J. Discrete Algorithms 25:42-50, 2014, Proc. CPM 2015: 65-76) described several data structures for answering LCE queries that offers a space-time trade-off between data structure size and query time. In particular, for a parameter 1τn1 \leq \tau \leq n, their best deterministic solution is a data structure of size O(n/τ)O(n/\tau) which allows LCE queries to be answered in O(τ)O(\tau) time. However, the construction time for all deterministic versions of their data structure is quadratic in nn. In this paper, we propose a deterministic solution that achieves a similar space-time trade-off of O(τmin{logτ,lognτ})O(\tau\min\{\log\tau,\log\frac{n}{\tau}\}) query time using O(n/τ)O(n/\tau) space, but significantly improve the construction time to O(nτ)O(n\tau).Comment: updated titl

    Repetition Detection in a Dynamic String

    Get PDF
    A string UU for a non-empty string U is called a square. Squares have been well-studied both from a combinatorial and an algorithmic perspective. In this paper, we are the first to consider the problem of maintaining a representation of the squares in a dynamic string S of length at most n. We present an algorithm that updates this representation in n^o(1) time. This representation allows us to report a longest square-substring of S in O(1) time and all square-substrings of S in O(output) time. We achieve this by introducing a novel tool - maintaining prefix-suffix matches of two dynamic strings. We extend the above result to address the problem of maintaining a representation of all runs (maximal repetitions) of the string. Runs are known to capture the periodic structure of a string, and, as an application, we show that our representation of runs allows us to efficiently answer periodicity queries for substrings of a dynamic string. These queries have proven useful in static pattern matching problems and our techniques have the potential of offering solutions to these problems in a dynamic text setting

    Heaviest Induced Ancestors and Longest Common Substrings

    Full text link
    Suppose we have two trees on the same set of leaves, in which nodes are weighted such that children are heavier than their parents. We say a node from the first tree and a node from the second tree are induced together if they have a common leaf descendant. In this paper we describe data structures that efficiently support the following heaviest-induced-ancestor query: given a node from the first tree and a node from the second tree, find an induced pair of their ancestors with maximum combined weight. Our solutions are based on a geometric interpretation that enables us to find heaviest induced ancestors using range queries. We then show how to use these results to build an LZ-compressed index with which we can quickly find with high probability a longest substring common to the indexed string and a given pattern

    Faster algorithms for 1-mappability of a sequence

    Full text link
    In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where {\sigma} is the alphabet size
    corecore