Search CORE

140 research outputs found

Estimating Renyi Entropy of Discrete Distributions

Author: Acharya Jayadev
Orlitsky Alon
Suresh Ananda Theertha
Tyagi Himanshu
Publication venue
Publication date: 10/03/2016
Field of study

It was recently shown that estimating the Shannon entropy

H({\rm p})

of a discrete

k

-symbol distribution

{\rm p}

requires

\Theta(k/\log k)

samples, a number that grows near-linearly in the support size. In many applications

H({\rm p})

can be replaced by the more general R\'enyi entropy of order

\alpha

H_\alpha({\rm p})

. We determine the number of samples needed to estimate

H_\alpha({\rm p})

for all

\alpha

, showing that

\alpha < 1

requires a super-linear, roughly

k^{1/\alpha}

samples, noninteger

\alpha>1

requires a near-linear

k

samples, but, perhaps surprisingly, integer

\alpha>1

requires only

\Theta(k^{1-1/\alpha})

samples. Furthermore, developing on a recently established connection between polynomial approximation and estimation of additive functions of the form

\sum_{x} f({\rm p}_x)

, we reduce the sample complexity for noninteger values of

\alpha

by a factor of

\log k

compared to the empirical estimator. The estimators achieving these bounds are simple and run in time linear in the number of samples. Our lower bounds provide explicit constructions of distributions with different R\'enyi entropies that are hard to distinguish

arXiv.org e-Print Archive

CiteSeerX

String Reconstruction from Substring Compositions

Author: Acharya Jayadev
Das Hirakendu
Milenkovic Olgica
Orlitsky Alon
Pan Shengjun
Publication venue
Publication date: 10/03/2014
Field of study

Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size

\ge4

in optimal near-quadratic time

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Universal Compression of Power-Law Distributions

Author: Falahatgar Moein
Jafarpour Ashkan
Orlitsky Alon
Pichapati Venkatadheeraj
Suresh Ananda Theertha
Publication venue
Publication date: 30/04/2015
Field of study

English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order

\alpha>1

is roughly the

1/\alpha

power of the expected redundancy of unrestricted distributions. Hence for these orders, Zipf distributions can be better compressed and predicted than was previously known. Unlike the expected case, we show that worst-case redundancy is roughly the same for Zipf and for unrestricted distributions. Hence Zipf distributions have significantly different worst-case and expected redundancies, making them the first natural distribution class shown to have such a difference.Comment: 20 page

arXiv.org e-Print Archive

Crossref