1,083 research outputs found
Missing -mass: Investigating the Missing Parts of Distributions
Estimating the underlying distribution from \textit{iid} samples is a
classical and important problem in statistics. When the alphabet size is large
compared to number of samples, a portion of the distribution is highly likely
to be unobserved or sparsely observed. The missing mass, defined as the sum of
probabilities over the missing letters , and the Good-Turing
estimator for missing mass have been important tools in large-alphabet
distribution estimation. In this article, given a positive function from
to the reals, the missing -mass, defined as the sum of
over the missing letters , is introduced and studied. The
missing -mass can be used to investigate the structure of the missing part
of the distribution. Specific applications for special cases such as
order- missing mass () and the missing Shannon entropy
() include estimating distance from uniformity of the missing
distribution and its partial estimation. Minimax estimation is studied for
order- missing mass for integer values of and exact minimax
convergence rates are obtained. Concentration is studied for a class of
functions and specific results are derived for order- missing mass
and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal
worst-case variance factors are derived. Two new notions of concentration,
named strongly sub-Gamma and filtered sub-Gaussian concentration, are
introduced and shown to result in right tail bounds that are better than those
obtained from sub-Gaussian concentration
Selective machine learning of doubly robust functionals
While model selection is a well-studied topic in parametric and nonparametric
regression or density estimation, selection of possibly high-dimensional
nuisance parameters in semiparametric problems is far less developed. In this
paper, we propose a selective machine learning framework for making inferences
about a finite-dimensional functional defined on a semiparametric model, when
the latter admits a doubly robust estimating function and several candidate
machine learning algorithms are available for estimating the nuisance
parameters. We introduce two new selection criteria for bias reduction in
estimating the functional of interest, each based on a novel definition of
pseudo-risk for the functional that embodies the double robustness property and
thus is used to select the pair of learners that is nearest to fulfilling this
property. We establish an oracle property for a multi-fold cross-validation
version of the new selection criteria which states that our empirical criteria
perform nearly as well as an oracle with a priori knowledge of the pseudo-risk
for each pair of candidate learners. We also describe a smooth approximation to
the selection criteria which allows for valid post-selection inference.
Finally, we apply the approach to model selection of a semiparametric estimator
of average treatment effect given an ensemble of candidate machine learners to
account for confounding in an observational study
Optimal estimation of high-order missing masses, and the rare-type match problem
Consider a random sample from an unknown discrete
distribution on a countable alphabet
, and let be the empirical frequencies of
distinct symbols 's in the sample. We consider the problem of estimating
the -order missing mass, which is a discrete functional of defined as
This is
generalization of the missing mass whose estimation is a classical problem in
statistics, being the subject of numerous studies both in theory and methods.
First, we introduce a nonparametric estimator of
and a corresponding non-asymptotic confidence interval through concentration
properties of . Then, we investigate minimax
estimation of , which is the main contribution of
our work. We show that minimax estimation is not feasible over the class of all
discrete distributions on , and not even for distributions with
regularly varying tails, which only guarantee that our estimator is consistent
for . This leads to introduce the stronger
assumption of second-order regular variation for the tail behaviour of ,
which is proved to be sufficient for minimax estimation of
, making the proposed estimator an optimal minimax
estimator of . Our interest in the -order
missing mass arises from forensic statistics, where the estimation of the
-order missing mass appears in connection to the estimation of the
likelihood ratio
,
known as the "fundamental problem of forensic mathematics". We present
theoretical guarantees to nonparametric estimation of
- …