90,807 research outputs found
The fused Kolmogorov filter: A nonparametric model-free screening method
A new model-free screening method called the fused Kolmogorov filter is
proposed for high-dimensional data analysis. This new method is fully
nonparametric and can work with many types of covariates and response
variables, including continuous, discrete and categorical variables. We apply
the fused Kolmogorov filter to deal with variable screening problems emerging
from a wide range of applications, such as multiclass classification,
nonparametric regression and Poisson regression, among others. It is shown that
the fused Kolmogorov filter enjoys the sure screening property under weak
regularity conditions that are much milder than those required for many
existing nonparametric screening methods. In particular, the fused Kolmogorov
filter can still be powerful when covariates are strongly dependent on each
other. We further demonstrate the superior performance of the fused Kolmogorov
filter over existing screening methods by simulations and real data examples.Comment: Published at http://dx.doi.org/10.1214/14-AOS1303 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Robust rank correlation based screening
Independence screening is a variable selection method that uses a ranking
criterion to select significant variables, particularly for statistical models
with nonpolynomial dimensionality or "large p, small n" paradigms when p can be
as large as an exponential of the sample size n. In this paper we propose a
robust rank correlation screening (RRCS) method to deal with ultra-high
dimensional data. The new procedure is based on the Kendall \tau correlation
coefficient between response and predictor variables rather than the Pearson
correlation of existing methods. The new method has four desirable features
compared with existing independence screening methods. First, the sure
independence screening property can hold only under the existence of a second
order moment of predictor variables, rather than exponential tails or
alikeness, even when the number of predictor variables grows as fast as
exponentially of the sample size. Second, it can be used to deal with
semiparametric models such as transformation regression models and single-index
models under monotonic constraint to the link function without involving
nonparametric estimation even when there are nonparametric functions in the
models. Third, the procedure can be largely used against outliers and influence
points in the observations. Last, the use of indicator functions in rank
correlation screening greatly simplifies the theoretical derivation due to the
boundedness of the resulting statistics, compared with previous studies on
variable screening. Simulations are carried out for comparisons with existing
methods and a real data example is analyzed.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1024 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org). arXiv admin note: text overlap with
arXiv:0903.525
Feature Screening via Distance Correlation Learning
This paper is concerned with screening features in ultrahigh dimensional data
analysis, which has become increasingly important in diverse scientific fields.
We develop a sure independence screening procedure based on the distance
correlation (DC-SIS, for short). The DC-SIS can be implemented as easily as the
sure independence screening procedure based on the Pearson correlation (SIS,
for short) proposed by Fan and Lv (2008). However, the DC-SIS can significantly
improve the SIS. Fan and Lv (2008) established the sure screening property for
the SIS based on linear models, but the sure screening property is valid for
the DC-SIS under more general settings including linear models. Furthermore,
the implementation of the DC-SIS does not require model specification (e.g.,
linear model or generalized linear model) for responses or predictors. This is
a very appealing property in ultrahigh dimensional data analysis. Moreover, the
DC-SIS can be used directly to screen grouped predictor variables and for
multivariate response variables. We establish the sure screening property for
the DC-SIS, and conduct simulations to examine its finite sample performance.
Numerical comparison indicates that the DC-SIS performs much better than the
SIS in various models. We also illustrate the DC-SIS through a real data
example.Comment: 32 pages, 5 tables and 1 figure. Wei Zhong is the corresponding
autho
- …