229 research outputs found

    Faster Compact On-Line Lempel-Ziv Factorization

    Get PDF
    We present a new on-line algorithm for computing the Lempel-Ziv factorization of a string that runs in O(NlogN)O(N\log N) time and uses only O(Nlogσ)O(N\log\sigma) bits of working space, where NN is the length of the string and σ\sigma is the size of the alphabet. This is a notable improvement compared to the performance of previous on-line algorithms using the same order of working space but running in either O(Nlog3N)O(N\log^3N) time (Okanohara & Sadakane 2009) or O(Nlog2N)O(N\log^2N) time (Starikovskaya 2012). The key to our new algorithm is in the utilization of an elegant but less popular index structure called Directed Acyclic Word Graphs, or DAWGs (Blumer et al. 1985). We also present an opportunistic variant of our algorithm, which, given the run length encoding of size mm of a string of length NN, computes the Lempel-Ziv factorization on-line, in O(mmin{(loglogm)(loglogN)logloglogN,logmloglogm})O\left(m \cdot \min \left\{\frac{(\log\log m)(\log \log N)}{\log\log\log N}, \sqrt{\frac{\log m}{\log \log m}} \right\}\right) time and O(mlogN)O(m\log N) bits of space, which is faster and more space efficient when the string is run-length compressible

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    A Grammar Compression Algorithm based on Induced Suffix Sorting

    Full text link
    We introduce GCIS, a grammar compression algorithm based on the induced suffix sorting algorithm SAIS, introduced by Nong et al. in 2009. Our solution builds on the factorization performed by SAIS during suffix sorting. We construct a context-free grammar on the input string which can be further reduced into a shorter string by substituting each substring by its correspondent factor. The resulting grammar is encoded by exploring some redundancies, such as common prefixes between suffix rules, which are sorted according to SAIS framework. When compared to well-known compression tools such as Re-Pair and 7-zip, our algorithm is competitive and very effective at handling repetitive string regarding compression ratio, compression and decompression running time

    Practical Aspects of Implementing a Suffix Array-based Lempel-Ziv Data Compressor

    Get PDF
    Lempel-Ziv factorization of a string is a fundamental tool that is used by myriad data compressors. Despite its optimality regarding the number of produced factors, it is rarely used without modification, for reasons of its computational cost. In recent years, Lempel-Ziv factorization has been a busy research subject, and we have witnessed the state-of-the-art being completely changed. In this thesis, I explore the properties of the latest suffix array-based Lempel-Ziv factorization algorithms, while I experiment with turning them into an efficient general-purpose data compressor. The setting of this thesis is purely exploratory, guided by reliable and repeatable benchmarking. I explore all aspects of the suffix array-based Lempel-Ziv data compressor. I describe how the chosen factorization method affects the development of encoding and other components of a functional data compressor. I show how the chosen factorization technique, together with capabilities of modern hardware, allows determining the length of the longest common prefix of two strings over 80% faster compared to the baseline approach. I also present a novel approach to optimizing the encoding cost of the Lempel-Ziv factorization of a string, i.e., bit-optimality, using a dynamic programming approach to the Single-Source Shortest Path problem. I observed that, in its current state, the process of suffix array construction is a major computational bottleneck in suffix array-based Lempel-Ziv factorization. Additionally, using a suffix array to produce a Lempel-Ziv factorization leads to optimality regarding the number of factors, which does not necessarily correspond to bit-optimality. Finally, a comparison with common third-party data compressors revealed that relying exclusively on Lempel-Ziv factorization prevents reaching the highest compression efficiency. For these reasons, I conclude that current suffix array-based Lempel-Ziv factorization is unsuitable for general-purpose data compression
    corecore