Search CORE

1,083 research outputs found

Missing $g$ -mass: Investigating the Missing Parts of Distributions

Author: Chandra Prafulla
Thangaraj Andrew
Publication venue
Publication date: 27/05/2023
Field of study

Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities

\text{Pr}(x)

over the missing letters

x

, and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function

g

from

[0,1]

to the reals, the missing

g

-mass, defined as the sum of

g(\text{Pr}(x))

over the missing letters

x

, is introduced and studied. The missing

g

-mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order-

\alpha

missing mass (

g(p)=p^{\alpha}

) and the missing Shannon entropy (

g(p)=-p\log p

) include estimating distance from uniformity of the missing distribution and its partial estimation. Minimax estimation is studied for order-

\alpha

missing mass for integer values of

\alpha

and exact minimax convergence rates are obtained. Concentration is studied for a class of functions

g

and specific results are derived for order-

\alpha

missing mass and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal worst-case variance factors are derived. Two new notions of concentration, named strongly sub-Gamma and filtered sub-Gaussian concentration, are introduced and shown to result in right tail bounds that are better than those obtained from sub-Gaussian concentration

arXiv.org e-Print Archive

Selective machine learning of doubly robust functionals

Author: Cui Yifan
Tchetgen Eric Tchetgen
Publication venue
Publication date: 12/04/2021
Field of study

While model selection is a well-studied topic in parametric and nonparametric regression or density estimation, selection of possibly high-dimensional nuisance parameters in semiparametric problems is far less developed. In this paper, we propose a selective machine learning framework for making inferences about a finite-dimensional functional defined on a semiparametric model, when the latter admits a doubly robust estimating function and several candidate machine learning algorithms are available for estimating the nuisance parameters. We introduce two new selection criteria for bias reduction in estimating the functional of interest, each based on a novel definition of pseudo-risk for the functional that embodies the double robustness property and thus is used to select the pair of learners that is nearest to fulfilling this property. We establish an oracle property for a multi-fold cross-validation version of the new selection criteria which states that our empirical criteria perform nearly as well as an oracle with a priori knowledge of the pseudo-risk for each pair of candidate learners. We also describe a smooth approximation to the selection criteria which allows for valid post-selection inference. Finally, we apply the approach to model selection of a semiparametric estimator of average treatment effect given an ensemble of candidate machine learners to account for confounding in an observational study

arXiv.org e-Print Archive

Optimal estimation of high-order missing masses, and the rare-type match problem

Author: Favaro Stefano
Naulet Zacharie
Publication venue
Publication date: 26/06/2023
Field of study

Consider a random sample

(X_{1},\ldots,X_{n})

from an unknown discrete distribution

P=\sum_{j\geq1}p_{j}\delta_{s_{j}}

on a countable alphabet

\mathbb{S}

, and let

(Y_{n,j})_{j\geq1}

be the empirical frequencies of distinct symbols

s_{j}

's in the sample. We consider the problem of estimating the

r

-order missing mass, which is a discrete functional of

P

defined as

\theta_{r}(P;\mathbf{X}_{n})=\sum_{j\geq1}p^{r}_{j}I(Y_{n,j}=0).

This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of

\theta_{r}(P;\mathbf{X}_{n})

and a corresponding non-asymptotic confidence interval through concentration properties of

\theta_{r}(P;\mathbf{X}_{n})

. Then, we investigate minimax estimation of

\theta_{r}(P;\mathbf{X}_{n})

, which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on

\mathbb{S}

, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for

\theta_{r}(P;\mathbf{X}_{n})

. This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of

P

, which is proved to be sufficient for minimax estimation of

\theta_r(P;\mathbf{X}_{n})

, making the proposed estimator an optimal minimax estimator of

\theta_{r}(P;\mathbf{X}_{n})

. Our interest in the

r

-order missing mass arises from forensic statistics, where the estimation of the

2

-order missing mass appears in connection to the estimation of the likelihood ratio

T(P,\mathbf{X}_{n})=\theta_{1}(P;\mathbf{X}_{n})/\theta_{2}(P;\mathbf{X}_{n})

, known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of

T(P,\mathbf{X}_{n})

arXiv.org e-Print Archive