Search CORE

3 research outputs found

Missing $g$ -mass: Investigating the Missing Parts of Distributions

Author: Chandra Prafulla
Thangaraj Andrew
Publication venue
Publication date: 27/05/2023
Field of study

Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities

\text{Pr}(x)

over the missing letters

x

, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function

g

from

[0,1]

to the reals, the missing

g

-mass, defined as the sum of

g(\text{Pr}(x))

over the missing letters

x

, is introduced and studied. The missing

g

-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-

\alpha

missing mass (

g(p)=p^{\alpha}

) and the missing Shannon entropy (

g(p)=-p\log p

) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-

\alpha

missing mass for integer values of

\alpha

and exact minimax convergence rates are obtained. Concentration is studied for a class of functions

g

and specific results are derived for order-

\alpha

missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration

arXiv.org e-Print Archive