14 research outputs found

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

    Decompressing Lempel-Ziv Compressed Text

    Full text link
    We consider the problem of decompressing the Lempel--Ziv 77 representation of a string SS of length nn using a working space as close as possible to the size zz of the input. The folklore solution for the problem runs in O(n)O(n) time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size O(zlog(n/z))O(z\log(n/z)) and then stream SS in linear time. In this paper, we show that O(n)O(n) time and O(z)O(z) working space can be achieved for constant-size alphabets. On general alphabets of size σ\sigma, we describe (i) a trade-off achieving O(nlogδσ)O(n\log^\delta \sigma) time and O(zlog1δσ)O(z\log^{1-\delta}\sigma) space for any 0δ10\leq \delta\leq 1, and (ii) a solution achieving O(n)O(n) time and O(zloglog(n/z))O(z\log\log (n/z)) space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of SS with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text

    Approximating Edit Distance in the Fully Dynamic Model

    Full text link
    The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most nn, simple dynamic programming computes their edit distance exactly in O(n2)O(n^2) time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in O~(n)\tilde{O}(n) time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an O(nc)O(n^{c})-approximation in n0.5c+o(1)n^{0.5-c+o(1)} update time for any constant c[0,16]c\in [0,\frac16]. Improving upon this trade-off remains open. The contribution of this work is a dynamic no(1)n^{o(1)}-approximation algorithm with amortized expected update time of no(1)n^{o(1)}. In other words, we bring the approximation-ratio and update-time product down to no(1)n^{o(1)}. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202

    Breaking the O(n)O(n)-Barrier in the Construction of Compressed Suffix Arrays

    Full text link
    The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-nn text uses Θ(nlogn)\Theta(n \log n) bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-nn text over an alphabet of size σ\sigma, these data structures use only O(nlogσ)O(n \log \sigma) bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for σ=2\sigma=2, the fastest algorithm remains stuck at O(n)O(n) time [Hon et al., FOCS 2003], which is slower by a Θ(logn)\Theta(\log n) factor than the lower bound of Ω(n/logn)\Omega(n / \log n) (following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes O(nlogσ)O(n \log \sigma) bits, answers suffix array queries in O(logϵn)O(\log^{\epsilon} n) time, and can be constructed in O(nlogσ/logn)O(n\log \sigma / \sqrt{\log n}) time using O(nlogσ)O(n\log \sigma) bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using O(nlogσ)O(n \log \sigma) bits).Comment: 41 page

    Small space and streaming pattern matching with k edits

    Full text link
    In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer kk, a pattern PP of length mm, and a text TT of length nmn \ge m, the task is to find substrings of TT that are within edit distance kk from PP. Our main result is a streaming algorithm that solves the problem in O~(k5)\tilde{O}(k^5) space and O~(k8)\tilde{O}(k^8) amortised time per character of the text, providing answers correct with high probability. (Hereafter, O~()\tilde{O}(\cdot) hides a poly(logn)\mathrm{poly}(\log n) factor.) This answers a decade-old question: since the discovery of a poly(klogn)\mathrm{poly}(k\log n)-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no poly(klogn)\mathrm{poly}(k\log n)-space algorithm was known even in the simpler semi-streaming model, where TT comes as a stream but PP is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers nkn\ge k. For any string of length at most nn, the sketch is of size O~(k2)\tilde{O}(k^2) and it can be computed with an O~(k2)\tilde{O}(k^2)-space streaming algorithm. Given the sketches of two strings, in O~(k3)\tilde{O}(k^3) time we can compute their edit distance or certify that it is larger than kk. This result improves upon O~(k8)\tilde{O}(k^8)-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent O~(k3)\tilde{O}(k^3)-size sketches of Jin, Nelson, and Wu [STACS 2021]
    corecore