23 research outputs found

    The Discrepancy Principle for Choosing Bandwidths in Kernel Density Estimation

    Full text link
    We investigate the discrepancy principle for choosing smoothing parameters for kernel density estimation. The method is based on the distance between the empirical and estimated distribution functions. We prove some new positive and negative results on L_1-consistency of kernel estimators with bandwidths chosen using the discrepancy principle. Consistency crucially depends on a rather weak H\"older condition on the distribution function. We also unify and extend previous results on the behaviour of the chosen bandwidth under more strict smoothness assumptions. Furthermore, we compare the discrepancy principle to standard methods in a simulation study. Surprisingly, some of the proposals work reasonably well over a large set of different densities and sample sizes, and the performance of the methods at least up to n=2500 can be quite different from their asymptotic behavior.Comment: 17 pages, 3 figures. Section on histograms removed, new (positive and negative) consistency results for kernel density estimators adde

    K-rank : an evolution of y-rank for multiple solutions problem

    Get PDF
    Y-rank can present faults when dealing with non-linear problems. A methodology is proposed to improve the selection of data in situations where y-rank is fragile. The proposed alternative, called k-rank, consists of splitting the data set into clusters using the k-means algorithm, and then apply y-rank to the generated clusters. Models were calibrated and tested with subsets split by y-rank and k-rank. For the Heating Tank case study, in 59% of the simulations, models calibrated with k-rank subsets achieved better results. For the Propylene / Propane Separation Unit case, when dealing with a small number of sample points, the y-rank models had errors almost three times higher than the k-rank models for the test subset, meaning that the fitted model could not deal properly with new unseen data. The proposed methodology was successful in splitting the data, especially in cases with a limited amount of samples

    Optimal cross-validation in density estimation with the L2L^2-loss

    Full text link
    We analyze the performance of cross-validation (CV) in the density estimation framework with two purposes: (i) risk estimation and (ii) model selection. The main focus is given to the so-called leave-pp-out CV procedure (Lpo), where pp denotes the cardinality of the test set. Closed-form expressions are settled for the Lpo estimator of the risk of projection estimators. These expressions provide a great improvement upon VV-fold cross-validation in terms of variability and computational complexity. From a theoretical point of view, closed-form expressions also enable to study the Lpo performance in terms of risk estimation. The optimality of leave-one-out (Loo), that is Lpo with p=1p=1, is proved among CV procedures used for risk estimation. Two model selection frameworks are also considered: estimation, as opposed to identification. For estimation with finite sample size nn, optimality is achieved for pp large enough [with p/n=o(1)p/n=o(1)] to balance the overfitting resulting from the structure of the model collection. For identification, model selection consistency is settled for Lpo as long as p/np/n is conveniently related to the rate of convergence of the best estimator in the collection: (i) p/n→1p/n\to1 as n→+∞n\to+\infty with a parametric rate, and (ii) p/n=o(1)p/n=o(1) with some nonparametric estimators. These theoretical results are validated by simulation experiments.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1240 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Constructing irregular histograms by penalized likelihood

    Get PDF
    We propose a fully automatic procedure for the construction of irregular histograms. For a given number of bins, the maximum likelihood histogram is known to be the result of a dynamic programming algorithm. To choose the number of bins, we propose two different penalties motivated by recent work in model selection by Castellan [6] and Massart [26]. We give a complete description of the algorithm and a proper tuning of the penalties. Finally, we compare our procedure to other existing proposals for a wide range of different densities and sample sizes. --irregular histogram,density estimation,penalized likelihood,dynamic programming

    On efficient estimators of the proportion of true null hypotheses in a multiple testing setup

    Full text link
    We consider the problem of estimating the proportion θ\theta of true null hypotheses in a multiple testing context. The setup is classically modeled through a semiparametric mixture with two components: a uniform distribution on interval [0,1][0,1] with prior probability θ\theta and a nonparametric density ff. We discuss asymptotic efficiency results and establish that two different cases occur whether ff vanishes on a set with non null Lebesgue measure or not. In the first case, we exhibit estimators converging at parametric rate, compute the optimal asymptotic variance and conjecture that no estimator is asymptotically efficient (i.e. attains the optimal asymptotic variance). In the second case, we prove that the quadratic risk of any estimator does not converge at parametric rate. We illustrate those results on simulated data

    Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier

    Full text link
    The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the kk-nearest neighbors (kkNN) rule in the context of binary classification. Here we focus on the leave-pp-out cross-validation (LppO) used to assess the performance of the kkNN classifier. Remarkably this LppO estimator can be efficiently computed in this context using closed-form formulas derived by \cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and exponential concentration inequalities for the LppO estimator applied to the kkNN classifier. Such results are obtained first by exploiting the connection between the LppO estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L11O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the LppO estimator and the classification error/risk of the kkNN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments

    Nonparametric density estimation by exact leave-p-out cross-validation

    No full text
    The problem of density estimation is addressed by minimization of the L-2-risk for both histogram and kernel estimators. This quadratic risk is estimated by leave-p-out cross-validation (LPO), which is made possible thanks to closed formulas, contrary to common belief. The potential gain in the use of LPO with respect to V-fold cross-validation (V-fold) in terms of the bias-variance trade-off is highlighted. An exact quantification of this extra variability, induced by the preliminary random partition of the data in the V-fold, is proposed. Furthermore, exact expressions are derived for both the bias and the variance of the risk estimator with histograms. Plug-in estimates of these quantities are provided, while their accuracy is assessed thanks to concentration inequalities. An adaptive selection procedure for p in the case of histograms is subsequently presented. This relies on minimization of the mean square error of the LPO risk estimator. Finally a simulation study is carried out which first illustrates the higher reliability of the LPO with respect to the V-fold, and then assesses the behavior of the selection procedure. For instance optimality of leave-one-out (LOO) is shown, at least empirically, in the context of regular histograms
    corecore