12 research outputs found

    Practical Evaluation of Lempel-Ziv-78 and Lempel-Ziv-Welch Tries

    Full text link
    We present the first thorough practical study of the Lempel-Ziv-78 and the Lempel-Ziv-Welch computation based on trie data structures. With a careful selection of trie representations we can beat well-tuned popular trie data structures like Judy, m-Bonsai or Cedar

    Lempel-Ziv Compression in a Sliding Window

    Get PDF

    Small-Space LCE Data Structure with Constant-Time Queries

    Get PDF
    The longest common extension (LCE) problem is to preprocess a given string w of length n so that the length of the longest common prefix between suffixes of w that start at any two given positions is answered quickly. In this paper, we present a data structure of O(z tau^2 + frac{n}{tau}) words of space which answers LCE queries in O(1) time and can be built in O(n log sigma) time, where 1 leq tau leq sqrt{n} is a parameter, z is the size of the Lempel-Ziv 77 factorization of w and sigma is the alphabet size. The proposed LCE data structure not access the input string w when answering queries, and thus w can be deleted after preprocessing. On top of this main result, we obtain further results using (variants of) our LCE data structure, which include the following: - For highly repetitive strings where the ztau^2 term is dominated by frac{n}{tau}, we obtain a constant-time and sub-linear space LCE query data structure. - Even when the input string is not well compressible via Lempel-Ziv 77 factorization, we still can obtain a constant-time and sub-linear space LCE data structure for suitable tau and for sigma leq 2^{o(log n)}. - The time-space trade-off lower bounds for the LCE problem by Bille et al. [J. Discrete Algorithms, 25:42-50, 2014] and by Kosolobov [CoRR, abs/1611.02891, 2016] do not apply in some cases with our LCE data structure

    Lempel-Ziv Compression in a Sliding Window

    Get PDF
    We present new algorithms for the sliding window Lempel-Ziv (LZ77) problem and the approximate rightmost LZ77 parsing problem. Our main result is a new and surprisingly simple algorithm that computes the sliding window LZ77 parse in O(w) space and either O(n) expected time or O(n log log w+z log log s) deterministic time. Here, w is the window size, n is the size of the input string, z is the number of phrases in the parse, and s is the size of the alphabet. This matches the space and time bounds of previous results while removing constant size restrictions on the alphabet size. To achieve our result, we combine a simple modification and augmentation of the suffix tree with periodicity properties of sliding windows. We also apply this new technique to obtain an algorithm for the approximate rightmost LZ77 problem that uses O(n(log z + log log n)) time and O(n) space and produces a (1+e)-approximation of the rightmost parsing (any constant e>0). While this does not improve the best known time-space trade-offs for exact rightmost parsing, our algorithm is significantly simpler and exposes a direct connection between sliding window parsing and the approximate rightmost matching problem

    Compression with the tudocomp Framework

    Get PDF
    We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well on highly repetitive texts

    Computing All Distinct Squares in Linear Time for Integer Alphabets

    Get PDF
    Given a string on an integer alphabet, we present an algorithm that computes the set of all distinct squares belonging to this string in time linear to the string length. As an application, we show how to compute the tree topology of the minimal augmented suffix tree in linear time. Asides from that, we elaborate an algorithm computing the longest previous table in a succinct representation using compressed working space

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Text and Image Compression based on Data Mining Perspective

    Get PDF
    Data Compression has been one of the enabling technologies for the on-going digital multimedia revolution for decades which resulted in renowned algorithms like Huffman Encoding, LZ77, Gzip, RLE and JPEG etc. Researchers have looked into the character/word based approaches to Text and Image Compression missing out the larger aspect of pattern mining from large databases. The central theme of our compression research focuses on the Compression perspective of Data Mining as suggested by Naren Ramakrishnan et al. wherein efficient versions of seminal algorithms of Text/Image compression are developed using various Frequent Pattern Mining(FPM)/Clustering techniques. This paper proposes a cluster of novel and hybrid efficient text and image compression algorithms employing efficient data structures like Hash and Graphs. We have retrieved optimal set of patterns through pruning which is efficient in terms of database scan/storage space by reducing the code table size. Moreover, a detailed analysis of time and space complexity is performed for some of our approaches and various text structures are proposed. Simulation results over various spare/dense benchmark text corpora indicate 18% to 751% improvement in compression ratio over other state of the art techniques. In Image compression, our results showed up to 45% improvement in compression ratio and up to 40% in image quality efficiency
    corecore