3,138 research outputs found
Detection and Feature Selection in Sparse Mixture Models
We consider Gaussian mixture models in high dimensions and concentrate on the
twin tasks of detection and feature selection. Under sparsity assumptions on
the difference in means, we derive information bounds and establish the
performance of various procedures, including the top sparse eigenvalue of the
sample covariance matrix and other projection tests based on moments, such as
the skewness and kurtosis tests of Malkovich and Afifi (1973), and other
variants which we were better able to control under the null.Comment: 70 page
Bandwidth selection in kernel empirical risk minimization via the gradient
In this paper, we deal with the data-driven selection of multidimensional and
possibly anisotropic bandwidths in the general framework of kernel empirical
risk minimization. We propose a universal selection rule, which leads to
optimal adaptive results in a large variety of statistical models such as
nonparametric robust regression and statistical learning with errors in
variables. These results are stated in the context of smooth loss functions,
where the gradient of the risk appears as a good criterion to measure the
performance of our estimators. The selection rule consists of a comparison of
gradient empirical risks. It can be viewed as a nontrivial improvement of the
so-called Goldenshluger-Lepski method to nonlinear estimators. Furthermore, one
main advantage of our selection rule is the nondependency on the Hessian matrix
of the risk, usually involved in standard adaptive procedures.Comment: Published at http://dx.doi.org/10.1214/15-AOS1318 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Minimax Structured Normal Means Inference
We provide a unified treatment of a broad class of noisy structure recovery
problems, known as structured normal means problems. In this setting, the goal
is to identify, from a finite collection of Gaussian distributions with
different means, the distribution that produced some observed data. Recent work
has studied several special cases including sparse vectors, biclusters, and
graph-based structures. We establish nearly matching upper and lower bounds on
the minimax probability of error for any structured normal means problem, and
we derive an optimality certificate for the maximum likelihood estimator, which
can be applied to many instantiations. We also consider an experimental design
setting, where we generalize our minimax bounds and derive an algorithm for
computing a design strategy with a certain optimality property. We show that
our results give tight minimax bounds for many structure recovery problems and
consider some consequences for interactive sampling
Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms
Model-based clustering defines population level clusters relative to a model
that embeds notions of similarity. Algorithms tailored to such models yield
estimated clusters with a clear statistical interpretation. We take this view
here and introduce the class of G-block covariance models as a background model
for variable clustering. In such models, two variables in a cluster are deemed
similar if they have similar associations will all other variables. This can
arise, for instance, when groups of variables are noise corrupted versions of
the same latent factor. We quantify the difficulty of clustering data generated
from a G-block covariance model in terms of cluster proximity, measured with
respect to two related, but different, cluster separation metrics. We derive
minimax cluster separation thresholds, which are the metric values below which
no algorithm can recover the model-defined clusters exactly, and show that they
are different for the two metrics. We therefore develop two algorithms, COD and
PECOK, tailored to G-block covariance models, and study their
minimax-optimality with respect to each metric. Of independent interest is the
fact that the analysis of the PECOK algorithm, which is based on a corrected
convex relaxation of the popular K-means algorithm, provides the first
statistical analysis of such algorithms for variable clustering. Additionally,
we contrast our methods with another popular clustering method, spectral
clustering, specialized to variable clustering, and show that ensuring exact
cluster recovery via this method requires clusters to have a higher separation,
relative to the minimax threshold. Extensive simulation studies, as well as our
data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page
A Quasi-Bayesian Perspective to Online Clustering
When faced with high frequency streams of data, clustering raises theoretical
and algorithmic pitfalls. We introduce a new and adaptive online clustering
algorithm relying on a quasi-Bayesian approach, with a dynamic (i.e.,
time-dependent) estimation of the (unknown and changing) number of clusters. We
prove that our approach is supported by minimax regret bounds. We also provide
an RJMCMC-flavored implementation (called PACBO, see
https://cran.r-project.org/web/packages/PACBO/index.html) for which we give a
convergence guarantee. Finally, numerical experiments illustrate the potential
of our procedure
- …