140 research outputs found
Estimating Renyi Entropy of Discrete Distributions
It was recently shown that estimating the Shannon entropy of a
discrete -symbol distribution requires samples,
a number that grows near-linearly in the support size. In many applications
can be replaced by the more general R\'enyi entropy of order
, . We determine the number of samples needed to
estimate for all , showing that
requires a super-linear, roughly samples, noninteger
requires a near-linear samples, but, perhaps surprisingly, integer
requires only samples. Furthermore,
developing on a recently established connection between polynomial
approximation and estimation of additive functions of the form , we reduce the sample complexity for noninteger values of by a
factor of compared to the empirical estimator. The estimators
achieving these bounds are simple and run in time linear in the number of
samples. Our lower bounds provide explicit constructions of distributions with
different R\'enyi entropies that are hard to distinguish
String Reconstruction from Substring Compositions
Motivated by mass-spectrometry protein sequencing, we consider a
simply-stated problem of reconstructing a string from the multiset of its
substring compositions. We show that all strings of length 7, one less than a
prime, or one less than twice a prime, can be reconstructed uniquely up to
reversal. For all other lengths we show that reconstruction is not always
possible and provide sometimes-tight bounds on the largest number of strings
with given substring compositions. The lower bounds are derived by
combinatorial arguments and the upper bounds by algebraic considerations that
precisely characterize the set of strings with the same substring compositions
in terms of the factorization of bivariate polynomials. The problem can be
viewed as a combinatorial simplification of the turnpike problem, and its
solution may shed light on this long-standing problem as well. Using well known
results on transience of multi-dimensional random walks, we also provide a
reconstruction algorithm that reconstructs random strings over alphabets of
size in optimal near-quadratic time
Universal Compression of Power-Law Distributions
English words and the outputs of many other natural processes are well-known
to follow a Zipf distribution. Yet this thoroughly-established property has
never been shown to help compress or predict these important processes. We show
that the expected redundancy of Zipf distributions of order is
roughly the power of the expected redundancy of unrestricted
distributions. Hence for these orders, Zipf distributions can be better
compressed and predicted than was previously known. Unlike the expected case,
we show that worst-case redundancy is roughly the same for Zipf and for
unrestricted distributions. Hence Zipf distributions have significantly
different worst-case and expected redundancies, making them the first natural
distribution class shown to have such a difference.Comment: 20 page
- …