36 research outputs found
Universal Compression of Power-Law Distributions
English words and the outputs of many other natural processes are well-known
to follow a Zipf distribution. Yet this thoroughly-established property has
never been shown to help compress or predict these important processes. We show
that the expected redundancy of Zipf distributions of order is
roughly the power of the expected redundancy of unrestricted
distributions. Hence for these orders, Zipf distributions can be better
compressed and predicted than was previously known. Unlike the expected case,
we show that worst-case redundancy is roughly the same for Zipf and for
unrestricted distributions. Hence Zipf distributions have significantly
different worst-case and expected redundancies, making them the first natural
distribution class shown to have such a difference.Comment: 20 page
Large alphabets: Finite, infinite, and scaling models
How can we effectively model situations with large alphabets? On a pragmatic level, any engineered system, be it for inference, communication, or encryption, requires working with a finite number of symbols. Therefore, the most straight-forward model is a finite alphabet. However, to emphasize the disproportionate size of the alphabet, one may want to compare its finite size with the length of data at hand. More generally, this gives rise to scaling models that strive to capture regimes of operation where one anticipates such imbalance. Large alphabets may also be idealized as infinite. The caveat then is that such generality strips away many of the convenient machinery of finite settings. However, some of it may be salvaged by refocusing the tasks of interest, such as by moving from sequence to pattern compression, or by minimally restricting the classes of infinite models, such as via tail properties. In this paper we present an overview of models for large alphabets, some recent results, and possible directions in this area
Functional central limit theorems for occupancies and missing mass process in infinite urn models
We study the infinite urn scheme when the balls are sequentially distributed
over an infinite number of urns labelled 1,2,... so that the urn at every
draw gets a ball with probability , . We prove functional
central limit theorems for discrete time and the poissonised version for the
urn occupancies process, for the odd-occupancy and for the missing mass
processes extending the known non-functional central limit theorems
Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties
We study a generalization of deduplication, which enables lossless
deduplication of highly similar data and show that standard deduplication with
fixed chunk length is a special case. We provide bounds on the expected length
of coded sequences for generalized deduplication and show that the coding has
asymptotic near-entropy cost under the proposed source model. More importantly,
we show that generalized deduplication allows for multiple orders of magnitude
faster convergence than standard deduplication. This means that generalized
deduplication can provide compression benefits much earlier than standard
deduplication, which is key in practical systems. Numerical examples
demonstrate our results, showing that our lower bounds are achievable, and
illustrating the potential gain of using the generalization over standard
deduplication. In fact, we show that even for a simple case of generalized
deduplication, the gain in convergence speed is linear with the size of the
data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for
GLOBECOM 201
Functional Central Limit Theorems for Occupancies and Missing Mass Process in Infinite Urn Models
We study the infinite urn scheme when the balls are sequentially distributed over an infinite number of urns labeled 1,2,..\ua0so that the urn j at every draw gets a ball with probability pj, where ∑ jpj= 1. We prove functional central limit theorems for discrete time and the Poissonized version for the urn occupancies process, for the odd occupancy and for the missing mass processes extending the known non-functional central limit theorems
Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes
This paper deals with the problem of universal lossless coding on a countable
infinite alphabet. It focuses on some classes of sources defined by an envelope
condition on the marginal distribution, namely exponentially decreasing
envelope classes with exponent . The minimax redundancy of
exponentially decreasing envelope classes is proved to be equivalent to
. Then a coding strategy is proposed, with
a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive
algorithm is provided, whose redundancy is equivalent to the minimax redundanc