9 research outputs found
Missing -mass: Investigating the Missing Parts of Distributions
Estimating the underlying distribution from \textit{iid} samples is a
classical and important problem in statistics. When the alphabet size is large
compared to number of samples, a portion of the distribution is highly likely
to be unobserved or sparsely observed. The missing mass, defined as the sum of
probabilities over the missing letters , and the Good-Turing
estimator for missing mass have been important tools in large-alphabet
distribution estimation. In this article, given a positive function from
to the reals, the missing -mass, defined as the sum of
over the missing letters , is introduced and studied. The
missing -mass can be used to investigate the structure of the missing part
of the distribution. Specific applications for special cases such as
order- missing mass () and the missing Shannon entropy
() include estimating distance from uniformity of the missing
distribution and its partial estimation. Minimax estimation is studied for
order- missing mass for integer values of and exact minimax
convergence rates are obtained. Concentration is studied for a class of
functions and specific results are derived for order- missing mass
and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal
worst-case variance factors are derived. Two new notions of concentration,
named strongly sub-Gamma and filtered sub-Gaussian concentration, are
introduced and shown to result in right tail bounds that are better than those
obtained from sub-Gaussian concentration
Concentration inequalities in the infinite urn scheme for occupancy counts and the missing mass, with applications
An infinite urn scheme is defined by a probability mass function
over positive integers. A random allocation consists of a
sample of independent drawings according to this probability distribution
where may be deterministic or Poisson-distributed. This paper is concerned
with occupancy counts, that is with the number of symbols with or at least
occurrences in the sample, and with the missing mass that is the total
probability of all symbols that do not occur in the sample. Without any further
assumption on the sampling distribution, these random quantities are shown to
satisfy Bernstein-type concentration inequalities. The variance factors in
these concentration inequalities are shown to be tight if the sampling
distribution satisfies a regular variation property. This regular variation
property reads as follows. Let the number of symbols with probability larger
than be . In a regularly varying urn
scheme, satisfies for and the variance of the
number of distinct symbols in a sample tends to infinity as the sample size
tends to infinity. Among other applications, these concentration inequalities
allow us to derive tight confidence intervals for the Good--Turing estimator of
the missing mass.Comment: Published at http://dx.doi.org/10.3150/15-BEJ743 in the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
On the Impossibility of Learning the Missing Mass
This paper shows that one cannot learn the probability of rare events without imposing further structural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an i.i.d. sample from a discrete distribution. The probability of this event is referred to as the “missing mass”. The impossibility result can then be stated as: the missing mass is not distribution-free learnable in relative error. The proof is semi-constructive and relies on a coupling argument using a dithered geometric distribution. Via a reduction, this impossibility also extends to both discrete and continuous tail estimation. These results formalize the folklore that in order to predict rare events without restrictive modeling, one necessarily needs distributions with “heavy tails”