140 research outputs found

    Estimating Renyi Entropy of Discrete Distributions

    Full text link
    It was recently shown that estimating the Shannon entropy H(p)H({\rm p}) of a discrete kk-symbol distribution p{\rm p} requires Θ(k/logk)\Theta(k/\log k) samples, a number that grows near-linearly in the support size. In many applications H(p)H({\rm p}) can be replaced by the more general R\'enyi entropy of order α\alpha, Hα(p)H_\alpha({\rm p}). We determine the number of samples needed to estimate Hα(p)H_\alpha({\rm p}) for all α\alpha, showing that α<1\alpha < 1 requires a super-linear, roughly k1/αk^{1/\alpha} samples, noninteger α>1\alpha>1 requires a near-linear kk samples, but, perhaps surprisingly, integer α>1\alpha>1 requires only Θ(k11/α)\Theta(k^{1-1/\alpha}) samples. Furthermore, developing on a recently established connection between polynomial approximation and estimation of additive functions of the form xf(px)\sum_{x} f({\rm p}_x), we reduce the sample complexity for noninteger values of α\alpha by a factor of logk\log k compared to the empirical estimator. The estimators achieving these bounds are simple and run in time linear in the number of samples. Our lower bounds provide explicit constructions of distributions with different R\'enyi entropies that are hard to distinguish

    String Reconstruction from Substring Compositions

    Full text link
    Motivated by mass-spectrometry protein sequencing, we consider a simply-stated problem of reconstructing a string from the multiset of its substring compositions. We show that all strings of length 7, one less than a prime, or one less than twice a prime, can be reconstructed uniquely up to reversal. For all other lengths we show that reconstruction is not always possible and provide sometimes-tight bounds on the largest number of strings with given substring compositions. The lower bounds are derived by combinatorial arguments and the upper bounds by algebraic considerations that precisely characterize the set of strings with the same substring compositions in terms of the factorization of bivariate polynomials. The problem can be viewed as a combinatorial simplification of the turnpike problem, and its solution may shed light on this long-standing problem as well. Using well known results on transience of multi-dimensional random walks, we also provide a reconstruction algorithm that reconstructs random strings over alphabets of size 4\ge4 in optimal near-quadratic time

    Universal Compression of Power-Law Distributions

    Full text link
    English words and the outputs of many other natural processes are well-known to follow a Zipf distribution. Yet this thoroughly-established property has never been shown to help compress or predict these important processes. We show that the expected redundancy of Zipf distributions of order α>1\alpha>1 is roughly the 1/α1/\alpha power of the expected redundancy of unrestricted distributions. Hence for these orders, Zipf distributions can be better compressed and predicted than was previously known. Unlike the expected case, we show that worst-case redundancy is roughly the same for Zipf and for unrestricted distributions. Hence Zipf distributions have significantly different worst-case and expected redundancies, making them the first natural distribution class shown to have such a difference.Comment: 20 page
    corecore