1,240 research outputs found
Nearly optimal minimax estimator for high-dimensional sparse linear regression
We present estimators for a well studied statistical estimation problem: the
estimation for the linear regression model with soft sparsity constraints
( constraint with ) in the high-dimensional setting. We first
present a family of estimators, called the projected nearest neighbor estimator
and show, by using results from Convex Geometry, that such estimator is within
a logarithmic factor of the optimal for any design matrix. Then by utilizing a
semi-definite programming relaxation technique developed in [SIAM J. Comput. 36
(2007) 1764-1776], we obtain an approximation algorithm for computing the
minimax risk for any such estimation task and also a polynomial time nearly
optimal estimator for the important case of sparsity constraint. Such
results were only known before for special cases, despite decades of studies on
this problem. We also extend the method to the adaptive case when the parameter
radius is unknown.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1141 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Classification with the nearest neighbor rule in general finite dimensional spaces: necessary and sufficient conditions
Given an -sample of random vectors whose
joint law is unknown, the long-standing problem of supervised classification
aims to \textit{optimally} predict the label of a given a new observation
. In this context, the nearest neighbor rule is a popular flexible and
intuitive method in non-parametric situations.
Even if this algorithm is commonly used in the machine learning and
statistics communities, less is known about its prediction ability in general
finite dimensional spaces, especially when the support of the density of the
observations is . This paper is devoted to the study of the
statistical properties of the nearest neighbor rule in various situations. In
particular, attention is paid to the marginal law of , as well as the
smoothness and margin properties of the \textit{regression function} . We identify two necessary and sufficient conditions to
obtain uniform consistency rates of classification and to derive sharp
estimates in the case of the nearest neighbor rule. Some numerical experiments
are proposed at the end of the paper to help illustrate the discussion.Comment: 53 Pages, 3 figure
Optimal Calibration for Multiple Testing against Local Inhomogeneity in Higher Dimension
Based on two independent samples X_1,...,X_m and X_{m+1},...,X_n drawn from
multivariate distributions with unknown Lebesgue densities p and q
respectively, we propose an exact multiple test in order to identify
simultaneously regions of significant deviations between p and q. The
construction is built from randomized nearest-neighbor statistics. It does not
require any preliminary information about the multivariate densities such as
compact support, strict positivity or smoothness and shape properties. The
properly adjusted multiple testing procedure is shown to be sharp-optimal for
typical arrangements of the observation values which appear with probability
close to one. The proof relies on a new coupling Bernstein type exponential
inequality, reflecting the non-subgaussian tail behavior of a combinatorial
process. For power investigation of the proposed method a reparametrized
minimax set-up is introduced, reducing the composite hypothesis "p=q" to a
simple one with the multivariate mixed density (m/n)p+(1-m/n)q as infinite
dimensional nuisance parameter. Within this framework, the test is shown to be
spatially and sharply asymptotically adaptive with respect to uniform loss on
isotropic H\"older classes. The exact minimax risk asymptotics are obtained in
terms of solutions of the optimal recovery
Classification with unknown class-conditional label noise on non-compact feature spaces
We investigate the problem of classification in the presence of unknown
class-conditional label noise in which the labels observed by the learner have
been corrupted with some unknown class dependent probability. In order to
obtain finite sample rates, previous approaches to classification with unknown
class-conditional label noise have required that the regression function is
close to its extrema on sets of large measure. We shall consider this problem
in the setting of non-compact metric spaces, where the regression function need
not attain its extrema.
In this setting we determine the minimax optimal learning rates (up to
logarithmic factors). The rate displays interesting threshold behaviour: When
the regression function approaches its extrema at a sufficient rate, the
optimal learning rates are of the same order as those obtained in the
label-noise free setting. If the regression function approaches its extrema
more gradually then classification performance necessarily degrades. In
addition, we present an adaptive algorithm which attains these rates without
prior knowledge of either the distributional parameters or the local density.
This identifies for the first time a scenario in which finite sample rates are
achievable in the label noise setting, but they differ from the optimal rates
without label noise
Global and Local Two-Sample Tests via Regression
Two-sample testing is a fundamental problem in statistics. Despite its long
history, there has been renewed interest in this problem with the advent of
high-dimensional and complex data. Specifically, in the machine learning
literature, there have been recent methodological developments such as
classification accuracy tests. The goal of this work is to present a regression
approach to comparing multivariate distributions of complex data. Depending on
the chosen regression model, our framework can efficiently handle different
types of variables and various structures in the data, with competitive power
under many practical scenarios. Whereas previous work has been largely limited
to global tests which conceal much of the local information, our approach
naturally leads to a local two-sample testing framework in which we identify
local differences between multivariate distributions with statistical
confidence. We demonstrate the efficacy of our approach both theoretically and
empirically, under some well-known parametric and nonparametric regression
methods. Our proposed methods are applied to simulated data as well as a
challenging astronomy data set to assess their practical usefulness
- …