29,960 research outputs found
RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Power and reproducibility are key to enabling refined scientific discoveries
in contemporary big data applications with general high-dimensional nonlinear
models. In this paper, we provide theoretical foundations on the power and
robustness for the model-free knockoffs procedure introduced recently in
Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the
covariate distribution is characterized by Gaussian graphical model. We
establish that under mild regularity conditions, the power of the oracle
knockoffs procedure with known covariate distribution in high-dimensional
linear models is asymptotically one as sample size goes to infinity. When
moving away from the ideal case, we suggest the modified model-free knockoffs
method called graphical nonlinear knockoffs (RANK) to accommodate the unknown
covariate distribution. We provide theoretical justifications on the robustness
of our modified procedure by showing that the false discovery rate (FDR) is
asymptotically controlled at the target level and the power is asymptotically
one with the estimated covariate distribution. To the best of our knowledge,
this is the first formal theoretical result on the power for the knockoffs
procedure. Simulation results demonstrate that compared to existing approaches,
our method performs competitively in both FDR control and power. A real data
set is analyzed to further assess the performance of the suggested knockoffs
procedure.Comment: 37 pages, 6 tables, 9 pages supplementary materia
Robust rank correlation based screening
Independence screening is a variable selection method that uses a ranking
criterion to select significant variables, particularly for statistical models
with nonpolynomial dimensionality or "large p, small n" paradigms when p can be
as large as an exponential of the sample size n. In this paper we propose a
robust rank correlation screening (RRCS) method to deal with ultra-high
dimensional data. The new procedure is based on the Kendall \tau correlation
coefficient between response and predictor variables rather than the Pearson
correlation of existing methods. The new method has four desirable features
compared with existing independence screening methods. First, the sure
independence screening property can hold only under the existence of a second
order moment of predictor variables, rather than exponential tails or
alikeness, even when the number of predictor variables grows as fast as
exponentially of the sample size. Second, it can be used to deal with
semiparametric models such as transformation regression models and single-index
models under monotonic constraint to the link function without involving
nonparametric estimation even when there are nonparametric functions in the
models. Third, the procedure can be largely used against outliers and influence
points in the observations. Last, the use of indicator functions in rank
correlation screening greatly simplifies the theoretical derivation due to the
boundedness of the resulting statistics, compared with previous studies on
variable screening. Simulations are carried out for comparisons with existing
methods and a real data example is analyzed.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1024 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org). arXiv admin note: text overlap with
arXiv:0903.525
The fused Kolmogorov filter: A nonparametric model-free screening method
A new model-free screening method called the fused Kolmogorov filter is
proposed for high-dimensional data analysis. This new method is fully
nonparametric and can work with many types of covariates and response
variables, including continuous, discrete and categorical variables. We apply
the fused Kolmogorov filter to deal with variable screening problems emerging
from a wide range of applications, such as multiclass classification,
nonparametric regression and Poisson regression, among others. It is shown that
the fused Kolmogorov filter enjoys the sure screening property under weak
regularity conditions that are much milder than those required for many
existing nonparametric screening methods. In particular, the fused Kolmogorov
filter can still be powerful when covariates are strongly dependent on each
other. We further demonstrate the superior performance of the fused Kolmogorov
filter over existing screening methods by simulations and real data examples.Comment: Published at http://dx.doi.org/10.1214/14-AOS1303 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Propriety of Posteriors in Structured Additive Regression Models: Theory and Empirical Evidence
Structured additive regression comprises many semiparametric regression models such as generalized additive (mixed) models, geoadditive models, and hazard regression models within a unified framework. In a Bayesian formulation, nonparametric functions, spatial effects and further model components are specified in terms of multivariate Gaussian priors for high-dimensional vectors of regression coefficients. For several model terms, such as penalised splines or Markov random fields, these Gaussian prior distributions involve rank-deficient precision matrices, yielding partially improper priors. Moreover, hyperpriors for the variances (corresponding to inverse smoothing parameters) may also be specified as improper, e.g. corresponding to Jeffery's prior or a flat prior for the standard deviation. Hence, propriety of the joint posterior is a crucial issue for full Bayesian inference in particular if based on Markov chain Monte Carlo simulations. We establish theoretical results providing sufficient (and sometimes necessary) conditions for propriety and provide empirical evidence through several accompanying simulation studies
High-dimensional Structured Additive Regression Models: Bayesian Regularisation, Smoothing and Predictive Performance
Data structures in modern applications frequently combine the necessity of flexible regression techniques such as nonlinear and spatial effects with high-dimensional covariate vectors. While estimation of the former is typically achieved by supplementing the likelihood with a suitable smoothness penalty, the latter are usually assigned shrinkage penalties that enforce sparse models.
In this paper, we consider a Bayesian unifying perspective, where conditionally Gaussian priors can be assigned to all types of regression effects. Suitable hyperprior assumptions on the variances of the Gaussian distributions then induce the desired smoothness or sparseness properties. As a major advantage, general Markov chain Monte Carlo simulation algorithms can be developed that allow for the joint estimation of smooth and spatial effects
and regularised coefficient vectors. Two applications demonstrate the usefulness of the proposed procedure: A geoadditive regression model for data from the Munich rental guide and an additive probit model for the prediction of consumer credit defaults. In both cases, high-dimensional vectors of categorical covariates will be included in the regression models. The predictive ability of the resulting high-dimensional structure additive regression models compared to expert models will be of particular relevance and will be evaluated on cross-validation test data
- …