959 research outputs found

    Deterministic indexing for packed strings

    Get PDF
    Given a string S of length n, the classic string indexing problem is to preprocess S into a compact data structure that supports efficient subsequent pattern queries. In the deterministic variant the goal is to solve the string indexing problem without any randomization (at preprocessing time or query time). In the packed variant the strings are stored with several character in a single word, giving us the opportunity to read multiple characters simultaneously. Our main result is a new string index in the deterministic and packed setting. Given a packed string S of length n over an alphabet s, we show how to preprocess S in O(n) (deterministic) time and space O(n) such that given a packed pattern string of length m we can support queries in (deterministic) time O(m/a + log m + log log s), where a = w /log s is the number of characters packed in a word of size w = log n. Our query time is always at least as good as the previous best known bounds and whenever several characters are packed in a word, i.e., log s << w, the query times are faster

    String Synchronizing Sets: Sublinear-Time BWT Construction and Optimal LCE Data Structure

    Full text link
    Burrows-Wheeler transform (BWT) is an invertible text transformation that, given a text TT of length nn, permutes its symbols according to the lexicographic order of suffixes of TT. BWT is one of the most heavily studied algorithms in data compression with numerous applications in indexing, sequence analysis, and bioinformatics. Its construction is a bottleneck in many scenarios, and settling the complexity of this task is one of the most important unsolved problems in sequence analysis that has remained open for 25 years. Given a binary string of length nn, occupying O(n/logn)O(n/\log n) machine words, the BWT construction algorithm due to Hon et al. (SIAM J. Comput., 2009) runs in O(n)O(n) time and O(n/logn)O(n/\log n) space. Recent advancements (Belazzougui, STOC 2014, and Munro et al., SODA 2017) focus on removing the alphabet-size dependency in the time complexity, but they still require Ω(n)\Omega(n) time. In this paper, we propose the first algorithm that breaks the O(n)O(n)-time barrier for BWT construction. Given a binary string of length nn, our procedure builds the Burrows-Wheeler transform in O(n/logn)O(n/\sqrt{\log n}) time and O(n/logn)O(n/\log n) space. We complement this result with a conditional lower bound proving that any further progress in the time complexity of BWT construction would yield faster algorithms for the very well studied problem of counting inversions: it would improve the state-of-the-art O(mlogm)O(m\sqrt{\log m})-time solution by Chan and P\v{a}tra\c{s}cu (SODA 2010). Our algorithm is based on a novel concept of string synchronizing sets, which is of independent interest. As one of the applications, we show that this technique lets us design a data structure of the optimal size O(n/logn)O(n/\log n) that answers Longest Common Extension queries (LCE queries) in O(1)O(1) time and, furthermore, can be deterministically constructed in the optimal O(n/logn)O(n/\log n) time.Comment: Full version of a paper accepted to STOC 201

    Fully-Functional Suffix Trees and Optimal Text Searching in BWT-runs Bounded Space

    Get PDF
    Indexing highly repetitive texts - such as genomic databases, software repositories and versioned text collections - has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time, O(m + occ), within O(r log log w ({\sigma} + n/r)) space, for a text of length n over an alphabet of size {\sigma} on a RAM machine with words of w = {\Omega}(log n) bits. Within that space, our index can also count in optimal time, O(m). Multiplying the space by O(w/ log {\sigma}), we support count and locate in O(dm log({\sigma})/we) and O(dm log({\sigma})/we + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n/r)) space that replaces the text and extracts any text substring of length ` in almost-optimal time O(log(n/r) + ` log({\sigma})/w). Within that space, we similarly provide direct access to suffix array, inverse suffix array, and longest common prefix array cells, and extend these capabilities to full suffix tree functionality, typically in O(log(n/r)) time per operation.Comment: submitted version; optimal count and locate in smaller space: O(r log log_w(n/r + sigma)

    Optimal-Time Text Indexing in BWT-runs Bounded Space

    Full text link
    Indexing highly repetitive texts --- such as genomic databases, software repositories and versioned text collections --- has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is rr, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r)O(r) space and was able to efficiently count the number of occurrences of a pattern of length mm in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of rr. Since then, a number of other indexes with space bounded by other measures of repetitiveness --- the number of phrases in the Lempel-Ziv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors --- have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occocc occurrences efficiently within O(r)O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ)O(m+occ) within O(rlog(n/r))O(r\log(n/r)) space, on a RAM machine of w=Ω(logn)w=\Omega(\log n) bits. Within O(rlog(n/r))O(r\log (n/r)) space, our index can also count in optimal time O(m)O(m). Raising the space to O(rwlogσ(n/r))O(r w\log_\sigma(n/r)), we support count and locate in O(mlog(σ)/w)O(m\log(\sigma)/w) and O(mlog(σ)/w+occ)O(m\log(\sigma)/w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(rlog(n/r))O(r\log(n/r)) space that replaces the text and extracts any text substring of length \ell in almost-optimal time O(log(n/r)+log(σ)/w)O(\log(n/r)+\ell\log(\sigma)/w). (...continues...

    Optimal Substring-Equality Queries with Applications to Sparse Text Indexing

    Full text link
    We consider the problem of encoding a string of length nn from an integer alphabet of size σ\sigma so that access and substring equality queries (that is, determining the equality of any two substrings) can be answered efficiently. Any uniquely-decodable encoding supporting access must take nlogσ+Θ(log(nlogσ))n\log\sigma + \Theta(\log (n\log\sigma)) bits. We describe a new data structure matching this lower bound when σnO(1)\sigma\leq n^{O(1)} while supporting both queries in optimal O(1)O(1) time. Furthermore, we show that the string can be overwritten in-place with this structure. The redundancy of Θ(logn)\Theta(\log n) bits and the constant query time break exponentially a lower bound that is known to hold in the read-only model. Using our new string representation, we obtain the first in-place subquadratic (indeed, even sublinear in some cases) algorithms for several string-processing problems in the restore model: the input string is rewritable and must be restored before the computation terminates. In particular, we describe the first in-place subquadratic Monte Carlo solutions to the sparse suffix sorting, sparse LCP array construction, and suffix selection problems. With the sole exception of suffix selection, our algorithms are also the first running in sublinear time for small enough sets of input suffixes. Combining these solutions, we obtain the first sublinear-time Monte Carlo algorithm for building the sparse suffix tree in compact space. We also show how to derandomize our algorithms using small space. This leads to the first Las Vegas in-place algorithm computing the full LCP array in O(nlogn)O(n\log n) time and to the first Las Vegas in-place algorithms solving the sparse suffix sorting and sparse LCP array construction problems in O(n1.5logσ)O(n^{1.5}\sqrt{\log \sigma}) time. Running times of these Las Vegas algorithms hold in the worst case with high probability.Comment: Refactored according to TALG's reviews. New w.h.p. bounds and Las Vegas algorithm

    c-trie++: A Dynamic Trie Tailored for Fast Prefix Searches

    Full text link
    Given a dynamic set KK of kk strings of total length nn whose characters are drawn from an alphabet of size σ\sigma, a keyword dictionary is a data structure built on KK that provides locate, prefix search, and update operations on KK. Under the assumption that α=w/lgσ\alpha = w / \lg \sigma characters fit into a single machine word ww, we propose a keyword dictionary that represents KK in nlgσ+Θ(klgn)n \lg \sigma + \Theta(k \lg n) bits of space, supporting all operations in O(m/α+lgα)O(m / \alpha + \lg \alpha) expected time on an input string of length mm in the word RAM model. This data structure is underlined with an exhaustive practical evaluation, highlighting the practical usefulness of the proposed data structure, especially for prefix searches - one of the most elementary keyword dictionary operations

    Full-fledged Real-Time Indexing for Constant Size Alphabets

    Full text link
    In this paper we describe a data structure that supports pattern matching queries on a dynamically arriving text over an alphabet ofconstant size. Each new symbol can be prepended to TT in O(1) worst-case time. At any moment, we can report all occurrences of a pattern PP in the current text in O(P+k)O(|P|+k) time, where P|P| is the length of PP and kk is the number of occurrences. This resolves, under assumption of constant-size alphabet, a long-standing open problem of existence of a real-time indexing method for string matching (see \cite{AmirN08})
    corecore