193 research outputs found

    Faster Compact On-Line Lempel-Ziv Factorization

    Get PDF
    We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in O(NlogN)O(N\log N) time and uses only O(Nlogσ)O(N\log\sigma) bits of working space, where NN is the length of the string and σ\sigma is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either O(Nlog3N)O(N\log^3N) time (Okanohara & Sadakane 2009) or O(Nlog2N)O(N\log^2N) time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size mm of a string of length NN, computes the Lempel-Ziv factorization on-line, in O(mmin{(loglogm)(loglogN)logloglogN,logmloglogm})O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right) time and O(mlogN)O(m\log N) bits of space, which is faster and more space efficient when the string is run-length compressible

    Lempel-Ziv Factorization May Be Harder Than Computing All Runs

    Get PDF
    The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in O(nlogσ)O(n\log\sigma) time, where nn is the length of the input string and σ\sigma is the number of distinct letters in it. We prove an Ω(nlogσ)\Omega(n\log\sigma) lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an o(nlogσ)o(n\log\sigma) time bound. In contrast with this, we exhibit an O(n)O(n) decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.Comment: 12 pages, 3 figures, submitte

    A Grammar Compression Algorithm based on Induced Suffix Sorting

    Full text link
    We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time

    Fast online Lempel-Ziv factorization in compressed space

    Get PDF
    Let T be a text of length n on an alphabet \u3a3 of size \u3c3, and let H0 be the zero-order empirical entropy of T. We show that the LZ77 factorization of T can be computed in nH0+o(nlog\u3c3)+O(\u3c3logn) bits of working space with an online algorithm running in O(nlogn) time. Previous space-efficient online solutions either work in compact space and O(nlogn) time, or in succinct space and O(nlog3n) time

    Universal Compressed Text Indexing

    Get PDF
    The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let γ\gamma be the size of a string attractor for a text of length nn. Our index takes O(γlog(n/γ))O(\gamma\log(n/\gamma)) words of space and supports locating the occocc occurrences of any pattern of length mm in O(mlogn+occlogϵn)O(m\log n + occ\log^{\epsilon}n) time, for any constant ϵ>0\epsilon>0. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment
    corecore