5 research outputs found
Missing Mass of Rank-2 Markov Chains
Estimation of missing mass with the popular Good-Turing (GT) estimator is
well-understood in the case where samples are independent and identically
distributed (iid). In this article, we consider the same problem when the
samples come from a stationary Markov chain with a rank-2 transition matrix,
which is one of the simplest extensions of the iid case. We develop an upper
bound on the absolute bias of the GT estimator in terms of the spectral gap of
the chain and a tail bound on the occupancy of states. Borrowing tail bounds
from known concentration results for Markov chains, we evaluate the bound using
other parameters of the chain. The analysis, supported by simulations, suggests
that, for rank-2 irreducible chains, the GT estimator has bias and mean-squared
error falling with number of samples at a rate that depends loosely on the
connectivity of the states in the chain
Missing -mass: Investigating the Missing Parts of Distributions
Estimating the underlying distribution from \textit{iid} samples is a
classical and important problem in statistics. When the alphabet size is large
compared to number of samples, a portion of the distribution is highly likely
to be unobserved or sparsely observed. The missing mass, defined as the sum of
probabilities over the missing letters , and the Good-Turing
estimator for missing mass have been important tools in large-alphabet
distribution estimation. In this article, given a positive function from
to the reals, the missing -mass, defined as the sum of
over the missing letters , is introduced and studied. The
missing -mass can be used to investigate the structure of the missing part
of the distribution. Specific applications for special cases such as
order- missing mass () and the missing Shannon entropy
() include estimating distance from uniformity of the missing
distribution and its partial estimation. Minimax estimation is studied for
order- missing mass for integer values of and exact minimax
convergence rates are obtained. Concentration is studied for a class of
functions and specific results are derived for order- missing mass
and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal
worst-case variance factors are derived. Two new notions of concentration,
named strongly sub-Gamma and filtered sub-Gaussian concentration, are
introduced and shown to result in right tail bounds that are better than those
obtained from sub-Gaussian concentration