43,817 research outputs found
The fused Kolmogorov filter: A nonparametric model-free screening method
A new model-free screening method called the fused Kolmogorov filter is
proposed for high-dimensional data analysis. This new method is fully
nonparametric and can work with many types of covariates and response
variables, including continuous, discrete and categorical variables. We apply
the fused Kolmogorov filter to deal with variable screening problems emerging
from a wide range of applications, such as multiclass classification,
nonparametric regression and Poisson regression, among others. It is shown that
the fused Kolmogorov filter enjoys the sure screening property under weak
regularity conditions that are much milder than those required for many
existing nonparametric screening methods. In particular, the fused Kolmogorov
filter can still be powerful when covariates are strongly dependent on each
other. We further demonstrate the superior performance of the fused Kolmogorov
filter over existing screening methods by simulations and real data examples.Comment: Published at http://dx.doi.org/10.1214/14-AOS1303 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification
We propose a high dimensional classification method that involves
nonparametric feature augmentation. Knowing that marginal density ratios are
the most powerful univariate classifiers, we use the ratio estimates to
transform the original feature measurements. Subsequently, penalized logistic
regression is invoked, taking as input the newly transformed or augmented
features. This procedure trains models equipped with local complexity and
global simplicity, thereby avoiding the curse of dimensionality while creating
a flexible nonlinear decision boundary. The resulting method is called Feature
Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by
generalizing the Naive Bayes model, writing the log ratio of joint densities as
a linear combination of those of marginal densities. It is related to
generalized additive models, but has better interpretability and computability.
Risk bounds are developed for FANS. In numerical analysis, FANS is compared
with competing methods, so as to provide a guideline on its best application
domain. Real data analysis demonstrates that FANS performs very competitively
on benchmark email spam and gene expression data sets. Moreover, FANS is
implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure
Large-Scale Kernel Methods for Independence Testing
Representations of probability measures in reproducing kernel Hilbert spaces
provide a flexible framework for fully nonparametric hypothesis tests of
independence, which can capture any type of departure from independence,
including nonlinear associations and multivariate interactions. However, these
approaches come with an at least quadratic computational cost in the number of
observations, which can be prohibitive in many applications. Arguably, it is
exactly in such large-scale datasets that capturing any type of dependence is
of interest, so striking a favourable tradeoff between computational efficiency
and test performance for kernel independence tests would have a direct impact
on their applicability in practice. In this contribution, we provide an
extensive study of the use of large-scale kernel approximations in the context
of independence testing, contrasting block-based, Nystrom and random Fourier
feature approaches. Through a variety of synthetic data experiments, it is
demonstrated that our novel large scale methods give comparable performance
with existing methods whilst using significantly less computation time and
memory.Comment: 29 pages, 6 figure
Nonparametric estimation of extremal dependence
There is an increasing interest to understand the dependence structure of a
random vector not only in the center of its distribution but also in the tails.
Extreme-value theory tackles the problem of modelling the joint tail of a
multivariate distribution by modelling the marginal distributions and the
dependence structure separately. For estimating dependence at high levels, the
stable tail dependence function and the spectral measure are particularly
convenient. These objects also lie at the basis of nonparametric techniques for
modelling the dependence among extremes in the max-domain of attraction
setting. In case of asymptotic independence, this setting is inadequate, and
more refined tail dependence coefficients exist, serving, among others, to
discriminate between asymptotic dependence and independence. Throughout, the
methods are illustrated on financial data.Comment: 22 pages, 9 figure
- …