17 research outputs found

    Algorithms and data structures for grammar-compressed strings

    Get PDF

    From LZ77 to the run-length encoded burrows-wheeler transform, and back

    Get PDF
    The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes z and r closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output. Let n be the text length. We show that RLBW T can be converted to LZ77 in O(n log r) time and O(r) words of working space. Conversely, we provide an algorithm to convert LZ77 to RLBW T in O n(log r + log z) time and O(r + z) words of working space. Note that r and z can be constant if the text is highly repetitive, and our algorithms can operate with (up to) exponentially less space than naive solutions based on full decompression

    Space-efficient conversions from SLPs

    Full text link
    We give algorithms that, given a straight-line program (SLP) with gg rules that generates (only) a text T[1..n]T [1..n], builds within O(g)O(g) space the Lempel-Ziv (LZ) parse of TT (of zz phrases) in time O(nlog2n)O(n\log^2 n) or in time O(gzlog2(n/z))O(gz\log^2(n/z)). We also show how to build a locally consistent grammar (LCG) of optimal size glc=O(δlognδ)g_{lc} = O(\delta\log\frac{n}{\delta}) from the SLP within O(g+glc)O(g+g_{lc}) space and in O(nlogg)O(n\log g) time, where δ\delta is the substring complexity measure of TT. Finally, we show how to build the LZ parse of TT from such a LCG within O(glc)O(g_{lc}) space and in time O(zlog2nlog2(n/z))O(z\log^2 n \log^2(n/z)). All our results hold with high probability

    Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

    Full text link
    Grammar compression is a general compression framework in which a string TT of length NN is represented as a context-free grammar of size nn whose language contains only TT. In this paper, we focus on studying the limitations of algorithms and data structures operating on strings in grammar-compressed form. Previous work focused on proving lower bounds for grammars constructed using algorithms that achieve the approximation ratio ρ=O(polylog N)\rho=\mathcal{O}(\text{polylog }N). Unfortunately, for the majority of grammar compressors, ρ\rho is either unknown or satisfies ρ=ω(polylog N)\rho=\omega(\text{polylog }N). In their seminal paper, Charikar et al. [IEEE Trans. Inf. Theory 2005] studied seven popular grammar compression algorithms: RePair, Greedy, LongestMatch, Sequential, Bisection, LZ78, and α\alpha-Balanced. Only one of them (α\alpha-Balanced) is known to achieve ρ=O(polylog N)\rho=\mathcal{O}(\text{polylog }N). We develop the first technique for proving lower bounds for data structures and algorithms on grammars that is fully general and does not depend on the approximation ratio ρ\rho of the used grammar compressor. Using this technique, we first prove that Ω(logN/loglogN)\Omega(\log N/\log \log N) time is required for random access on RePair, Greedy, LongestMatch, Sequential, and Bisection, while Ω(loglogN)\Omega(\log\log N) time is required for random access to LZ78. All these lower bounds hold within space O(n polylog N)\mathcal{O}(n\text{ polylog }N) and match the existing upper bounds. We also generalize this technique to prove several conditional lower bounds for compressed computation. For example, we prove that unless the Combinatorial kk-Clique Conjecture fails, there is no combinatorial algorithm for CFG parsing on Bisection (for which it holds ρ=Θ~(N1/2)\rho=\tilde{\Theta}(N^{1/2})) that runs in O(ncN3ϵ)\mathcal{O}(n^c\cdot N^{3-\epsilon}) time for all constants c>0c>0 and ϵ>0\epsilon>0. Previously, this was known only for c<2ϵc<2\epsilon

    Compression by Contracting Straight-Line Programs

    Get PDF
    In grammar-based compression a string is represented by a context-free grammar, also called a straight-line program (SLP), that generates only that string. We refine a recent balancing result stating that one can transform an SLP of size gg in linear time into an equivalent SLP of size O(g)O(g) so that the height of the unique derivation tree is O(logN)O(\log N) where NN is the length of the represented string (FOCS 2019). We introduce a new class of balanced SLPs, called contracting SLPs, where for every rule Aβ1βkA \to \beta_1 \dots \beta_k the string length of every variable βi\beta_i on the right-hand side is smaller by a constant factor than the string length of AA. In particular, the derivation tree of a contracting SLP has the property that every subtree has logarithmic height in its leaf size. We show that a given SLP of size gg can be transformed in linear time into an equivalent contracting SLP of size O(g)O(g) with rules of constant length. We present an application to the navigation problem in compressed unranked trees, represented by forest straight-line programs (FSLPs). We extend a linear space data structure by Reh and Sieber (2020) by the operation of moving to the ii-th child in time O(logd)O(\log d) where dd is the degree of the current node. Contracting SLPs are also applied to the finger search problem over SLP-compressed strings where one wants to access positions near to a pre-specified finger position, ideally in O(logd)O(\log d) time where dd is the distance between the accessed position and the finger. We give a linear space solution where one can access symbols or move the finger in time O(logd+log(t)N)O(\log d + \log^{(t)} N) for any constant tt where log(t)N\log^{(t)} N is the tt-fold logarithm of NN. This improves a previous solution by Bille, Christiansen, Cording, and G{\o}rtz (2018) with access/move time O(logd+loglogN)O(\log d + \log \log N)
    corecore