1,806 research outputs found

    Lempel-Ziv Parsing in External Memory

    Full text link
    For decades, computing the LZ factorization (or LZ77 parsing) of a string has been a requisite and computationally intensive step in many diverse applications, including text indexing and data compression. Many algorithms for LZ77 parsing have been discovered over the years; however, despite the increasing need to apply LZ77 to massive data sets, no algorithm to date scales to inputs that exceed the size of internal memory. In this paper we describe the first algorithm for computing the LZ77 parsing in external memory. Our algorithm is fast in practice and will allow the next generation of text indexes to be realised for massive strings and string collections.Comment: 10 page

    On Match Lengths, Zero Entropy and Large Deviations - with Application to Sliding Window Lempel-Ziv Algorithm

    Full text link
    The Sliding Window Lempel-Ziv (SWLZ) algorithm that makes use of recurrence times and match lengths has been studied from various perspectives in information theory literature. In this paper, we undertake a finer study of these quantities under two different scenarios, i) \emph{zero entropy} sources that are characterized by strong long-term memory, and ii) the processes with weak memory as described through various mixing conditions. For zero entropy sources, a general statement on match length is obtained. It is used in the proof of almost sure optimality of Fixed Shift Variant of Lempel-Ziv (FSLZ) and SWLZ algorithms given in literature. Through an example of stationary and ergodic processes generated by an irrational rotation we establish that for a window of size nwn_w, a compression ratio given by O(lognwnwa)O(\frac{\log n_w}{{n_w}^a}) where aa depends on nwn_w and approaches 1 as nwn_w \rightarrow \infty, is obtained under the application of FSLZ and SWLZ algorithms. Also, we give a general expression for the compression ratio for a class of stationary and ergodic processes with zero entropy. Next, we extend the study of Ornstein and Weiss on the asymptotic behavior of the \emph{normalized} version of recurrence times and establish the \emph{large deviation property} (LDP) for a class of mixing processes. Also, an estimator of entropy based on recurrence times is proposed for which large deviation principle is proved for sources satisfying similar mixing conditions.Comment: accepted to appear in IEEE Transactions on Information Theor

    Pattern matching in Lempel-Ziv compressed strings: fast, simple, and deterministic

    Full text link
    Countless variants of the Lempel-Ziv compression are widely used in many real-life applications. This paper is concerned with a natural modification of the classical pattern matching problem inspired by the popularity of such compression methods: given an uncompressed pattern s[1..m] and a Lempel-Ziv representation of a string t[1..N], does s occur in t? Farach and Thorup gave a randomized O(nlog^2(N/n)+m) time solution for this problem, where n is the size of the compressed representation of t. We improve their result by developing a faster and fully deterministic O(nlog(N/n)+m) time algorithm with the same space complexity. Note that for highly compressible texts, log(N/n) might be of order n, so for such inputs the improvement is very significant. A (tiny) fragment of our method can be used to give an asymptotically optimal solution for the substring hashing problem considered by Farach and Muthukrishnan.Comment: submitte

    Pushdown Compression

    Get PDF
    The pressing need for eficient compression schemes for XML documents has recently been focused on stack computation [6, 9], and in particular calls for a formulation of information-lossless stack or pushdown compressors that allows a formal analysis of their performance and a more ambitious use of the stack in XML compression, where so far it is mainly connected to parsing mechanisms. In this paper we introduce the model of pushdown compressor, based on pushdown transducers that compute a single injective function while keeping the widest generality regarding stack computation. The celebrated Lempel-Ziv algorithm LZ78 [10] was introduced as a general purpose compression algorithm that outperforms finite-state compressors on all sequences. We compare the performance of the Lempel-Ziv algorithm with that of the pushdown compressors, or compression algorithms that can be implemented with a pushdown transducer. This comparison is made without any a priori assumption on the data's source and considering the asymptotic compression ratio for infinite sequences. We prove that Lempel-Ziv is incomparable with pushdown compressors
    corecore