2,395 research outputs found

    Selective machine learning of doubly robust functionals

    Full text link
    While model selection is a well-studied topic in parametric and nonparametric regression or density estimation, selection of possibly high-dimensional nuisance parameters in semiparametric problems is far less developed. In this paper, we propose a selective machine learning framework for making inferences about a finite-dimensional functional defined on a semiparametric model, when the latter admits a doubly robust estimating function and several candidate machine learning algorithms are available for estimating the nuisance parameters. We introduce two new selection criteria for bias reduction in estimating the functional of interest, each based on a novel definition of pseudo-risk for the functional that embodies the double robustness property and thus is used to select the pair of learners that is nearest to fulfilling this property. We establish an oracle property for a multi-fold cross-validation version of the new selection criteria which states that our empirical criteria perform nearly as well as an oracle with a priori knowledge of the pseudo-risk for each pair of candidate learners. We also describe a smooth approximation to the selection criteria which allows for valid post-selection inference. Finally, we apply the approach to model selection of a semiparametric estimator of average treatment effect given an ensemble of candidate machine learners to account for confounding in an observational study

    Adaptive estimation of High-Dimensional Signal-to-Noise Ratios

    Full text link
    We consider the equivalent problems of estimating the residual variance, the proportion of explained variance η\eta and the signal strength in a high-dimensional linear regression model with Gaussian random design. Our aim is to understand the impact of not knowing the sparsity of the regression parameter and not knowing the distribution of the design on minimax estimation rates of η\eta. Depending on the sparsity kk of the regression parameter, optimal estimators of η\eta either rely on estimating the regression parameter or are based on U-type statistics, and have minimax rates depending on kk. In the important situation where kk is unknown, we build an adaptive procedure whose convergence rate simultaneously achieves the minimax risk over all kk up to a logarithmic loss which we prove to be non avoidable. Finally, the knowledge of the design distribution is shown to play a critical role. When the distribution of the design is unknown, consistent estimation of explained variance is indeed possible in much narrower regimes than for known design distribution

    Optimal estimation of high-order missing masses, and the rare-type match problem

    Full text link
    Consider a random sample (X1,…,Xn)(X_{1},\ldots,X_{n}) from an unknown discrete distribution P=∑j≥1pjδsjP=\sum_{j\geq1}p_{j}\delta_{s_{j}} on a countable alphabet S\mathbb{S}, and let (Yn,j)j≥1(Y_{n,j})_{j\geq1} be the empirical frequencies of distinct symbols sjs_{j}'s in the sample. We consider the problem of estimating the rr-order missing mass, which is a discrete functional of PP defined as θr(P;Xn)=∑j≥1pjrI(Yn,j=0).\theta_{r}(P;\mathbf{X}_{n})=\sum_{j\geq1}p^{r}_{j}I(Y_{n,j}=0). This is generalization of the missing mass whose estimation is a classical problem in statistics, being the subject of numerous studies both in theory and methods. First, we introduce a nonparametric estimator of θr(P;Xn)\theta_{r}(P;\mathbf{X}_{n}) and a corresponding non-asymptotic confidence interval through concentration properties of θr(P;Xn)\theta_{r}(P;\mathbf{X}_{n}). Then, we investigate minimax estimation of θr(P;Xn)\theta_{r}(P;\mathbf{X}_{n}), which is the main contribution of our work. We show that minimax estimation is not feasible over the class of all discrete distributions on S\mathbb{S}, and not even for distributions with regularly varying tails, which only guarantee that our estimator is consistent for θr(P;Xn)\theta_{r}(P;\mathbf{X}_{n}). This leads to introduce the stronger assumption of second-order regular variation for the tail behaviour of PP, which is proved to be sufficient for minimax estimation of θr(P;Xn)\theta_r(P;\mathbf{X}_{n}), making the proposed estimator an optimal minimax estimator of θr(P;Xn)\theta_{r}(P;\mathbf{X}_{n}). Our interest in the rr-order missing mass arises from forensic statistics, where the estimation of the 22-order missing mass appears in connection to the estimation of the likelihood ratio T(P,Xn)=θ1(P;Xn)/θ2(P;Xn)T(P,\mathbf{X}_{n})=\theta_{1}(P;\mathbf{X}_{n})/\theta_{2}(P;\mathbf{X}_{n}), known as the "fundamental problem of forensic mathematics". We present theoretical guarantees to nonparametric estimation of T(P,Xn)T(P,\mathbf{X}_{n})
    • …
    corecore