Search CORE

5 research outputs found

Missing Mass of Rank-2 Markov Chains

Author: Chandra Prafulla
Rajaraman Nived
Thangaraj Andrew
Publication venue
Publication date: 06/02/2021
Field of study

Estimation of missing mass with the popular Good-Turing (GT) estimator is well-understood in the case where samples are independent and identically distributed (iid). In this article, we consider the same problem when the samples come from a stationary Markov chain with a rank-2 transition matrix, which is one of the simplest extensions of the iid case. We develop an upper bound on the absolute bias of the GT estimator in terms of the spectral gap of the chain and a tail bound on the occupancy of states. Borrowing tail bounds from known concentration results for Markov chains, we evaluate the bound using other parameters of the chain. The analysis, supported by simulations, suggests that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared error falling with number of samples at a rate that depends loosely on the connectivity of the states in the chain

arXiv.org e-Print Archive

Missing $g$ -mass: Investigating the Missing Parts of Distributions

Author: Chandra Prafulla
Thangaraj Andrew
Publication venue
Publication date: 27/05/2023
Field of study

Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities

\text{Pr}(x)

over the missing letters

x

, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function

g

from

[0,1]

to the reals, the missing

g

-mass, defined as the sum of

g(\text{Pr}(x))

over the missing letters

x

, is introduced and studied. The missing

g

-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-

\alpha

missing mass (

g(p)=p^{\alpha}

) and the missing Shannon entropy (

g(p)=-p\log p

) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-

\alpha

missing mass for integer values of

\alpha

and exact minimax convergence rates are obtained. Concentration is studied for a class of functions

g

and specific results are derived for order-

\alpha

missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration

arXiv.org e-Print Archive

On consistent and rate optimal estimation of the missing mass

Author: Ayed Fadhel
Battiston Marco
Camerlenghi Federico
Favaro Stefano
Publication venue
Publication date: 12/11/2020
Field of study

Lancaster E-Prints

Institutional Research Information System University of Turin

Minimax risk for missing mass estimation

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref