Search CORE

9 research outputs found

Missing $g$ -mass: Investigating the Missing Parts of Distributions

Author: Chandra Prafulla
Thangaraj Andrew
Publication venue
Publication date: 27/05/2023
Field of study

Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities

\text{Pr}(x)

over the missing letters

x

, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function

g

from

[0,1]

to the reals, the missing

g

-mass, defined as the sum of

g(\text{Pr}(x))

over the missing letters

x

, is introduced and studied. The missing

g

-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-

\alpha

missing mass (

g(p)=p^{\alpha}

) and the missing Shannon entropy (

g(p)=-p\log p

) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-

\alpha

missing mass for integer values of

\alpha

and exact minimax convergence rates are obtained. Concentration is studied for a class of functions

g

and specific results are derived for order-

\alpha

missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration

arXiv.org e-Print Archive

Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications

Author: Ben-Hamou Anna
Boucheron Stéphane
Ohannessian Mesrob I.
Publication venue: 'Bernoulli Society for Mathematical Statistics and Probability'
Publication date: 09/01/2015
Field of study

An infinite urn scheme is defined by a probability mass function

(p_j)_{j\geq1}

over positive integers. A random allocation consists of a sample of

N

independent drawings according to this probability distribution where

N

may be deterministic or Poisson-distributed. This paper is concerned with occupancy counts, that is with the number of symbols with

r

or at least

r

occurrences in the sample, and with the missing mass that is the total probability of all symbols that do not occur in the sample. Without any further assumption on the sampling distribution, these random quantities are shown to satisfy Bernstein-type concentration inequalities. The variance factors in these concentration inequalities are shown to be tight if the sampling distribution satisfies a regular variation property. This regular variation property reads as follows. Let the number of symbols with probability larger than

x

\vec{\nu}(x)=|\{j:p_j\geq x\}|

. In a regularly varying urn scheme,

\vec{\nu}

satisfies

\lim_{\tau \rightarrow0}\vec{\nu}(\tau x)/\vec{\nu}(\tau)=x^{-\alpha}

for

\alpha\in[0,1]

and the variance of the number of distinct symbols in a sample tends to infinity as the sample size tends to infinity. Among other applications, these concentration inequalities allow us to derive tight confidence intervals for the Good--Turing estimator of the missing mass.Comment: Published at http://dx.doi.org/10.3150/15-BEJ743 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

On consistent and rate optimal estimation of the missing mass

Author: Ayed Fadhel
Battiston Marco
Camerlenghi Federico
Favaro Stefano
Publication venue
Publication date: 12/11/2020
Field of study

Lancaster E-Prints

Institutional Research Information System University of Turin

On the Impossibility of Learning the Missing Mass

Author: Elchanan Mossel
Mesrob I. Ohannessian
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

This paper shows that one cannot learn the probability of rare events without imposing further structural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an i.i.d. sample from a discrete distribution. The probability of this event is referred to as the “missing mass”. The impossibility result can then be stated as: the missing mass is not distribution-free learnable in relative error. The proof is semi-constructive and relies on a coupling argument using a dithered geometric distribution. Via a reduction, this impossibility also extends to both discrete and continuous tail estimation. These results formalize the folklore that in order to predict rare events without restrictive modeling, one necessarily needs distributions with “heavy tails”

Multidisciplinary Digital Publishing Institute

DSpace@MIT

Directory of Open Access Journals

On the Impossibility of Learning the Missing Mass

Author: Beirlant
Elchanan Mossel
Falahatgar
McAllester
McAllester
Mesrob I. Ohannessian
Ohannessian
Orlitsky
Taleb
Publication venue: 'MDPI AG'
Publication date
Field of study

Crossref