    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    Lempel-Ziv Parsing in External Memory

    For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page

    Small-Space LCE Data Structure with Constant-Time Queries

    The longest common extension (LCE) problem is to preprocess a given string w of length n so that the length of the longest common prefix between suffixes of w that start at any two given positions is answered quickly. In this paper, we present a data structure of O(z tau^2 + frac{n}{tau}) words of space which answers LCE queries in O(1) time and can be built in O(n log sigma) time, where 1 leq tau leq sqrt{n} is a parameter, z is the size of the Lempel-Ziv 77 factorization of w and sigma is the alphabet size. The proposed LCE data structure not access the input string w when answering queries, and thus w can be deleted after preprocessing. On top of this main result, we obtain further results using (variants of) our LCE data structure, which include the following: - For highly repetitive strings where the ztau^2 term is dominated by frac{n}{tau}, we obtain a constant-time and sub-linear space LCE query data structure. - Even when the input string is not well compressible via Lempel-Ziv 77 factorization, we still can obtain a constant-time and sub-linear space LCE data structure for suitable tau and for sigma leq 2^{o(log n)}. - The time-space trade-off lower bounds for the LCE problem by Bille et al. [J. Discrete Algorithms, 25:42-50, 2014] and by Kosolobov [CoRR, abs/1611.02891, 2016] do not apply in some cases with our LCE data structure

    Efficient LZ78 factorization of grammar compressed text

    We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size nn representing a text SS of length NN, our algorithm computes the LZ78 factorization of TT in O(nN+mlogN)O(n\sqrt{N}+m\log N) time and O(nN+m)O(n\sqrt{N}+m) space, where mm is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the nNn\sqrt{N} term in the time and space complexities becomes either nLnL, where LL is the length of the longest LZ78 factor, or (Nα)(N - \alpha) where α0\alpha \geq 0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of SS of a certain length. Since m=O(N/logσN)m = O(N/\log_\sigma N) where σ\sigma is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ\sigma is constant, and can be more efficient when the text is compressible, i.e. when mm and nn are small.Comment: SPIRE 201

    Lightweight Lempel-Ziv Parsing

    We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

    Lempel-Ziv Compression in a Sliding Window

    We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size. To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem

    Efficient string algorithmics across alphabet realms

    Stringology is a subfield of computer science dedicated to analyzing and processing sequences of symbols. It plays a crucial role in various applications, including lossless compression, information retrieval, natural language processing, and bioinformatics. Recent algorithms often assume that the strings to be processed are over polynomial integer alphabet, i.e., each symbol is an integer that is at most polynomial in the lengths of the strings. In contrast to that, the earlier days of stringology were shaped by the weaker comparison model, in which strings can only be accessed by mere equality comparisons of symbols, or (if the symbols are totally ordered) order comparisons of symbols. Nowadays, these flavors of the comparison model are respectively referred to as general unordered alphabet and general ordered alphabet. In this dissertation, we dive into the realm of both integer alphabets and general alphabets. We present new algorithms and lower bounds for classic problems, including Lempel-Ziv compression, computing the Lyndon array, and the detection of squares and runs. Our results show that, instead of only assuming the standard model of computation, it is important to also consider both weaker and stronger models. Particularly, we should not discard the older and weaker comparison-based models too quickly, as they are not only powerful theoretical tools, but also lead to fast and elegant practical solutions, even by today's standards

    Decompressing Lempel-Ziv Compressed Text

    We consider the problem of decompressing the Lempel--Ziv 77 representation of a string SS of length nn using a working space as close as possible to the size zz of the input. The folklore solution for the problem runs in O(n)O(n) time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size O(zlog(n/z))O(z\log(n/z)) and then stream SS in linear time. In this paper, we show that O(n)O(n) time and O(z)O(z) working space can be achieved for constant-size alphabets. On general alphabets of size σ\sigma, we describe (i) a trade-off achieving O(nlogδσ)O(n\log^\delta \sigma) time and O(zlog1δσ)O(z\log^{1-\delta}\sigma) space for any 0δ10\leq \delta\leq 1, and (ii) a solution achieving O(n)O(n) time and O(zloglog(n/z))O(z\log\log (n/z)) space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of SS with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text