23 research outputs found
The Discrepancy Principle for Choosing Bandwidths in Kernel Density Estimation
We investigate the discrepancy principle for choosing smoothing parameters
for kernel density estimation. The method is based on the distance between the
empirical and estimated distribution functions. We prove some new positive and
negative results on L_1-consistency of kernel estimators with bandwidths chosen
using the discrepancy principle. Consistency crucially depends on a rather weak
H\"older condition on the distribution function. We also unify and extend
previous results on the behaviour of the chosen bandwidth under more strict
smoothness assumptions. Furthermore, we compare the discrepancy principle to
standard methods in a simulation study. Surprisingly, some of the proposals
work reasonably well over a large set of different densities and sample sizes,
and the performance of the methods at least up to n=2500 can be quite different
from their asymptotic behavior.Comment: 17 pages, 3 figures. Section on histograms removed, new (positive and
negative) consistency results for kernel density estimators adde
K-rank : an evolution of y-rank for multiple solutions problem
Y-rank can present faults when dealing with non-linear problems. A methodology is proposed to improve the selection of data in situations where y-rank is fragile. The proposed alternative, called k-rank, consists of splitting the data set into clusters using the k-means algorithm, and then apply y-rank to the generated clusters. Models were calibrated and tested with subsets split by y-rank and k-rank. For the Heating Tank case study, in 59% of the simulations, models calibrated with k-rank subsets achieved better results. For the Propylene / Propane Separation Unit case, when dealing with a small number of sample points, the y-rank models had errors almost three times higher than the k-rank models for the test subset, meaning that the fitted model could not deal properly with new unseen data. The proposed methodology was successful in splitting the data, especially in cases with a limited amount of samples
Optimal cross-validation in density estimation with the -loss
We analyze the performance of cross-validation (CV) in the density estimation
framework with two purposes: (i) risk estimation and (ii) model selection. The
main focus is given to the so-called leave--out CV procedure (Lpo), where
denotes the cardinality of the test set. Closed-form expressions are
settled for the Lpo estimator of the risk of projection estimators. These
expressions provide a great improvement upon -fold cross-validation in terms
of variability and computational complexity. From a theoretical point of view,
closed-form expressions also enable to study the Lpo performance in terms of
risk estimation. The optimality of leave-one-out (Loo), that is Lpo with ,
is proved among CV procedures used for risk estimation. Two model selection
frameworks are also considered: estimation, as opposed to identification. For
estimation with finite sample size , optimality is achieved for large
enough [with ] to balance the overfitting resulting from the
structure of the model collection. For identification, model selection
consistency is settled for Lpo as long as is conveniently related to the
rate of convergence of the best estimator in the collection: (i) as
with a parametric rate, and (ii) with some
nonparametric estimators. These theoretical results are validated by simulation
experiments.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1240 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Constructing irregular histograms by penalized likelihood
We propose a fully automatic procedure for the construction of irregular histograms. For a given number of bins, the maximum likelihood histogram is known to be the result of a dynamic programming algorithm. To choose the number of bins, we propose two different penalties motivated by recent work in model selection by Castellan [6] and Massart [26]. We give a complete description of the algorithm and a proper tuning of the penalties. Finally, we compare our procedure to other existing proposals for a wide range of different densities and sample sizes. --irregular histogram,density estimation,penalized likelihood,dynamic programming
On efficient estimators of the proportion of true null hypotheses in a multiple testing setup
We consider the problem of estimating the proportion of true null
hypotheses in a multiple testing context. The setup is classically modeled
through a semiparametric mixture with two components: a uniform distribution on
interval with prior probability and a nonparametric density
. We discuss asymptotic efficiency results and establish that two different
cases occur whether vanishes on a set with non null Lebesgue measure or
not. In the first case, we exhibit estimators converging at parametric rate,
compute the optimal asymptotic variance and conjecture that no estimator is
asymptotically efficient (i.e. attains the optimal asymptotic variance). In the
second case, we prove that the quadratic risk of any estimator does not
converge at parametric rate. We illustrate those results on simulated data
Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier
The present work aims at deriving theoretical guaranties on the behavior of
some cross-validation procedures applied to the -nearest neighbors (NN)
rule in the context of binary classification. Here we focus on the
leave--out cross-validation (LO) used to assess the performance of the
NN classifier. Remarkably this LO estimator can be efficiently computed
in this context using closed-form formulas derived by
\cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and
exponential concentration inequalities for the LO estimator applied to the
NN classifier. Such results are obtained first by exploiting the connection
between the LO estimator and U-statistics, and second by making an intensive
use of the generalized Efron-Stein inequality applied to the LO estimator.
One other important contribution is made by deriving new quantifications of the
discrepancy between the LO estimator and the classification error/risk of
the NN classifier. The optimality of these bounds is discussed by means of
several lower bounds as well as simulation experiments
Nonparametric density estimation by exact leave-p-out cross-validation
The problem of density estimation is addressed by minimization of the L-2-risk for both histogram and kernel estimators. This quadratic risk is estimated by leave-p-out cross-validation (LPO), which is made possible thanks to closed formulas, contrary to common belief. The potential gain in the use of LPO with respect to V-fold cross-validation (V-fold) in terms of the bias-variance trade-off is highlighted. An exact quantification of this extra variability, induced by the preliminary random partition of the data in the V-fold, is proposed. Furthermore, exact expressions are derived for both the bias and the variance of the risk estimator with histograms. Plug-in estimates of these quantities are provided, while their accuracy is assessed thanks to concentration inequalities. An adaptive selection procedure for p in the case of histograms is subsequently presented. This relies on minimization of the mean square error of the LPO risk estimator. Finally a simulation study is carried out which first illustrates the higher reliability of the LPO with respect to the V-fold, and then assesses the behavior of the selection procedure. For instance optimality of leave-one-out (LOO) is shown, at least empirically, in the context of regular histograms