3,238 research outputs found
Entropy Lower Bounds for Dictionary Compression
We show that a wide class of dictionary compression methods (including LZ77, LZ78, grammar compressors as well as parsing-based structures) require |S|H_k(S) + Omega (|S|k log sigma/log_sigma |S|) bits to encode their output. This matches known upper bounds and improves the information-theoretic lower bound of |S|H_k(S). To this end, we abstract the crucial properties of parsings created by those methods, construct a certain family of strings and analyze the parsings of those strings. We also show that for k = alpha log_sigma |S|, where 0 < alpha < 1 is a constant, the aforementioned methods produce an output of size at least 1/(1-alpha)|S|H_k(S) bits. Thus our results separate dictionary compressors from context-based one (such as PPM) and BWT-based ones, as the those include methods achieving |S|H_k(S) + O(sigma^k log sigma) bits, i.e. the redundancy depends on k and sigma but not on |S|
Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties
We study a generalization of deduplication, which enables lossless
deduplication of highly similar data and show that standard deduplication with
fixed chunk length is a special case. We provide bounds on the expected length
of coded sequences for generalized deduplication and show that the coding has
asymptotic near-entropy cost under the proposed source model. More importantly,
we show that generalized deduplication allows for multiple orders of magnitude
faster convergence than standard deduplication. This means that generalized
deduplication can provide compression benefits much earlier than standard
deduplication, which is key in practical systems. Numerical examples
demonstrate our results, showing that our lower bounds are achievable, and
illustrating the potential gain of using the generalization over standard
deduplication. In fact, we show that even for a simple case of generalized
deduplication, the gain in convergence speed is linear with the size of the
data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for
GLOBECOM 201
New Algorithms and Lower Bounds for Sequential-Access Data Compression
This thesis concerns sequential-access data compression, i.e., by algorithms
that read the input one or more times from beginning to end. In one chapter we
consider adaptive prefix coding, for which we must read the input character by
character, outputting each character's self-delimiting codeword before reading
the next one. We show how to encode and decode each character in constant
worst-case time while producing an encoding whose length is worst-case optimal.
In another chapter we consider one-pass compression with memory bounded in
terms of the alphabet size and context length, and prove a nearly tight
tradeoff between the amount of memory we can use and the quality of the
compression we can achieve. In a third chapter we consider compression in the
read/write streams model, which allows us passes and memory both
polylogarithmic in the size of the input. We first show how to achieve
universal compression using only one pass over one stream. We then show that
one stream is not sufficient for achieving good grammar-based compression.
Finally, we show that two streams are necessary and sufficient for achieving
entropy-only bounds.Comment: draft of PhD thesi
Universal quantum information compression and degrees of prior knowledge
We describe a universal information compression scheme that compresses any
pure quantum i.i.d. source asymptotically to its von Neumann entropy, with no
prior knowledge of the structure of the source. We introduce a diagonalisation
procedure that enables any classical compression algorithm to be utilised in a
quantum context. Our scheme is then based on the corresponding quantum
translation of the classical Lempel-Ziv algorithm. Our methods lead to a
conceptually simple way of estimating the entropy of a source in terms of the
measurement of an associated length parameter while maintaining high fidelity
for long blocks. As a by-product we also estimate the eigenbasis of the source.
Since our scheme is based on the Lempel-Ziv method, it can be applied also to
target sequences that are not i.i.d.Comment: 17 pages, no figures. A preliminary version of this work was
presented at EQIS '02, Tokyo, September 200
On empirical cumulant generating functions of code lengths for individual sequences
We consider the problem of lossless compression of individual sequences using
finite-state (FS) machines, from the perspective of the best achievable
empirical cumulant generating function (CGF) of the code length, i.e., the
normalized logarithm of the empirical average of the exponentiated code length.
Since the probabilistic CGF is minimized in terms of the R\'enyi entropy of the
source, one of the motivations of this study is to derive an
individual-sequence analogue of the R\'enyi entropy, in the same way that the
FS compressibility is the individual-sequence counterpart of the Shannon
entropy. We consider the CGF of the code-length both from the perspective of
fixed-to-variable (F-V) length coding and the perspective of
variable-to-variable (V-V) length coding, where the latter turns out to yield a
better result, that coincides with the FS compressibility. We also extend our
results to compression with side information, available at both the encoder and
decoder. In this case, the V-V version no longer coincides with the FS
compressibility, but results in a different complexity measure.Comment: 15 pages; submitted for publicatio
- …