15 research outputs found

    Computing Lempel-Ziv Factorization Online

    Full text link
    We present an algorithm which computes the Lempel-Ziv factorization of a word WW of length nn on an alphabet Σ\Sigma of size σ\sigma online in the following sense: it reads WW starting from the left, and, after reading each r=O(logσn)r = O(\log_{\sigma} n) characters of WW, updates the Lempel-Ziv factorization. The algorithm requires O(nlogσ)O(n \log \sigma) bits of space and O(n \log^2 n) time. The basis of the algorithm is a sparse suffix tree combined with wavelet trees

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    Lightweight Lempel-Ziv Parsing

    Full text link
    We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

    Lempel-Ziv Factorization May Be Harder Than Computing All Runs

    Get PDF
    The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM algorithms in O(nlogσ)O(n\log\sigma) time, where nn is the length of the input string and σ\sigma is the number of distinct letters in it. We prove an Ω(nlogσ)\Omega(n\log\sigma) lower bound on the number of comparisons required to construct the Lempel-Ziv factorization and thereby conclude that a popular technique of computation of runs using the Lempel-Ziv factorization cannot achieve an o(nlogσ)o(n\log\sigma) time bound. In contrast with this, we exhibit an O(n)O(n) decision tree algorithm finding all runs in a string. Therefore, in the decision tree model the runs problem is easier than the Lempel-Ziv factorization. Thus we support the conjecture that there is a linear RAM algorithm finding all runs.Comment: 12 pages, 3 figures, submitte

    Fast online Lempel-Ziv factorization in compressed space

    Get PDF
    Let T be a text of length n on an alphabet \u3a3 of size \u3c3, and let H0 be the zero-order empirical entropy of T. We show that the LZ77 factorization of T can be computed in nH0+o(nlog\u3c3)+O(\u3c3logn) bits of working space with an online algorithm running in O(nlogn) time. Previous space-efficient online solutions either work in compact space and O(nlogn) time, or in succinct space and O(nlog3n) time

    Faster Compact On-Line Lempel-Ziv Factorization

    Get PDF
    We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in O(NlogN)O(N\log N) time and uses only O(Nlogσ)O(N\log\sigma) bits of working space, where NN is the length of the string and σ\sigma is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either O(Nlog3N)O(N\log^3N) time (Okanohara & Sadakane 2009) or O(Nlog2N)O(N\log^2N) time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size mm of a string of length NN, computes the Lempel-Ziv factorization on-line, in O(mmin{(loglogm)(loglogN)logloglogN,logmloglogm})O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right) time and O(mlogN)O(m\log N) bits of space, which is faster and more space efficient when the string is run-length compressible

    Lempel-Ziv Compression in a Sliding Window

    Get PDF
    We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size. To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem

    Lempel-Ziv Compression in a Sliding Window

    Get PDF

    Maintaining the size of LZ77 on semi-dynamic strings

    Get PDF
    We consider the problem of maintaining the size of the LZ77 factorization of a string S of length at most n under the following operations: (a) appending a given letter to S and (b) deleting the first letter of S. Our main result is an algorithm for this problem with amortized update time Õ(√n). As a corollary, we obtain an Õ(n√n)-time algorithm for computing the most LZ77-compressible rotation of a length-n string - a naive approach for this problem would compute the LZ77 factorization of each possible rotation and would thus take quadratic time in the worst case. We also show an Ω(√n) lower bound for the additive sensitivity of LZ77 with respect to the rotation operation. Our algorithm employs dynamic trees to maintain the longest-previous-factor array information and depends on periodicity-based arguments that bound the number of the required updates and enable their efficient computation
    corecore