Search CORE

36 research outputs found

Universal Compression of Power-Law Distributions

Author: Falahatgar Moein
Jafarpour Ashkan
Orlitsky Alon
Pichapati Venkatadheeraj
Suresh Ananda Theertha
Publication venue
Publication date: 30/04/2015
Field of study

English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order

\alpha>1

is roughly the

1/\alpha

power of the expected redundancy of unrestricted distributions. Hence for these orders, Zipf distributions can be better compressed and predicted than was previously known. Unlike the expected case, we show that worst-case redundancy is roughly the same for Zipf and for unrestricted distributions. Hence Zipf distributions have significantly different worst-case and expected redundancies, making them the first natural distribution class shown to have such a difference.Comment: 20 page

arXiv.org e-Print Archive

Crossref

Large alphabets: Finite, infinite, and scaling models

Author: Dahleh Munther A.
Ohannessian Mesrob I.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/03/2012
Field of study

How can we effectively model situations with large alphabets? On a pragmatic level, any engineered system, be it for inference, communication, or encryption, requires working with a finite number of symbols. Therefore, the most straight-forward model is a finite alphabet. However, to emphasize the disproportionate size of the alphabet, one may want to compare its finite size with the length of data at hand. More generally, this gives rise to scaling models that strive to capture regimes of operation where one anticipates such imbalance. Large alphabets may also be idealized as infinite. The caveat then is that such generality strips away many of the convenient machinery of finite settings. However, some of it may be salvaged by refocusing the tasks of interest, such as by moving from sequence to pattern compression, or by minimally restricting the classes of infinite models, such as via tail properties. In this paper we present an overview of models for large alphabets, some recent results, and possible directions in this area

DSpace@MIT

Crossref

Functional central limit theorems for occupancies and missing mass process in infinite urn models

Author: Chebunin Mikhail
Zuyev Sergei
Publication venue
Publication date: 26/06/2019
Field of study

We study the infinite urn scheme when the balls are sequentially distributed over an infinite number of urns labelled 1,2,... so that the urn

j

at every draw gets a ball with probability

p_j

\sum_j p_j=1

. We prove functional central limit theorems for discrete time and the poissonised version for the urn occupancies process, for the odd-occupancy and for the missing mass processes extending the known non-functional central limit theorems

arXiv.org e-Print Archive

Chalmers Research

Generalized Deduplication: Bounds, Convergence, and Asymptotic Properties

Author: Lucani Daniel E.
Vestergaard Rasmus
Zhang Qi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 07/08/2019
Field of study

We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that standard deduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences for generalized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. More importantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than standard deduplication. This means that generalized deduplication can provide compression benefits much earlier than standard deduplication, which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable, and illustrating the potential gain of using the generalization over standard deduplication. In fact, we show that even for a simple case of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.Comment: 15 pages, 4 figures. This is the full version of a paper accepted for GLOBECOM 201

arXiv.org e-Print Archive

Crossref

Functional Central Limit Theorems for Occupancies and Missing Mass Process in Infinite Urn Models

Author: Chebunin Mikhail
Zuev Sergey
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

We study the infinite urn scheme when the balls are sequentially distributed over an infinite number of urns labeled 1,2,..\ua0so that the urn j at every draw gets a ball with probability pj, where ∑ jpj= 1. We prove functional central limit theorems for discrete time and the Poissonized version for the urn occupancies process, for the odd occupancy and for the missing mass processes extending the known non-functional central limit theorems

Chalmers Research

Universal Coding on Infinite Alphabets: Exponentially Decreasing Envelopes

Author: Bontemps Dominique
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/05/2010
Field of study

This paper deals with the problem of universal lossless coding on a countable infinite alphabet. It focuses on some classes of sources defined by an envelope condition on the marginal distribution, namely exponentially decreasing envelope classes with exponent

\alpha

. The minimax redundancy of exponentially decreasing envelope classes is proved to be equivalent to

\frac{1}{4 \alpha \log e} \log^2 n

. Then a coding strategy is proposed, with a Bayes redundancy equivalent to the maximin redundancy. At last, an adaptive algorithm is provided, whose redundancy is equivalent to the minimax redundanc

arXiv.org e-Print Archive

CiteSeerX