Search CORE

9,848 research outputs found

Optimal estimation of high-order missing masses, and the rare-type match problem

Author: Favaro Stefano
Naulet Zacharie
Publication venue
Publication date: 26/06/2023
Field of study

Consider a random sample

(X_{1},\ldots,X_{n})

from an unknown discrete distribution

P=\sum_{j\geq1}p_{j}\delta_{s_{j}}

on a countable alphabet

\mathbb{S}

, and let

(Y_{n,j})_{j\geq1}

be the empirical frequencies of distinct symbols

s_{j}

's in the sample. We consider the problem of estimating the

r

-order missing mass, which is a discrete functional of

P

defined as

\theta_{r}(P;\mathbf{X}_{n})=\sum_{j\geq1}p^{r}_{j}I(Y_{n,j}=0).

This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of

\theta_{r}(P;\mathbf{X}_{n})

and a corresponding non-asymptotic confidence interval through concentration properties of

\theta_{r}(P;\mathbf{X}_{n})

. Then, we investigate minimax estimation of

\theta_{r}(P;\mathbf{X}_{n})

, which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on

\mathbb{S}

, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for

\theta_{r}(P;\mathbf{X}_{n})

. This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of

P

, which is proved to be sufficient for minimax estimation of

\theta_r(P;\mathbf{X}_{n})

, making the proposed estimator an optimal minimax estimator of

\theta_{r}(P;\mathbf{X}_{n})

. Our interest in the

r

-order missing mass arises from forensic statistics, where the estimation of the

2

-order missing mass appears in connection to the estimation of the likelihood ratio

T(P,\mathbf{X}_{n})=\theta_{1}(P;\mathbf{X}_{n})/\theta_{2}(P;\mathbf{X}_{n})

, known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of

T(P,\mathbf{X}_{n})

arXiv.org e-Print Archive

Minimax Estimation of Kernel Mean Embeddings

Author: Muandet Krikamol
Sriperumbudur Bharath
Tolstikhin Ilya
Publication venue
Publication date: 01/01/2017
Field of study

In this paper, we study the minimax estimation of the Bochner integral

\mu_k(P):=\int_{\mathcal{X}} k(\cdot,x)\,dP(x),

also called as the kernel mean embedding, based on random samples drawn i.i.d.~from

P

, where

k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}

is a positive definite kernel. Various estimators (including the empirical estimator),

\hat{\theta}_n

\mu_k(P)

are studied in the literature wherein all of them satisfy

\bigl\| \hat{\theta}_n-\mu_k(P)\bigr\|_{\mathcal{H}_k}=O_P(n^{-1/2})

with

\mathcal{H}_k

being the reproducing kernel Hilbert space induced by

k

. The main contribution of the paper is in showing that the above mentioned rate of

n^{-1/2}

is minimax in

\|\cdot\|_{\mathcal{H}_k}

and

\|\cdot\|_{L^2(\mathbb{R}^d)}

-norms over the class of discrete measures and the class of measures that has an infinitely differentiable density, with

k

being a continuous translation-invariant kernel on

\mathbb{R}^d

. The interesting aspect of this result is that the minimax rate is independent of the smoothness of the kernel and the density of

P

(if it exists). This result has practical consequences in statistical applications as the mean embedding has been widely employed in non-parametric hypothesis testing, density estimation, causal inference and feature selection, through its relation to energy distance (and distance covariance)

arXiv.org e-Print Archive

MPG.PuRe