126 research outputs found
Score Test Variable Screening
Variable screening has emerged as a crucial first step in the analysis of high-throughput data, but existing procedures can be computationally cumbersome, difficult to justify theoretically, or inapplicable to certain types of analyses. Motivated by a high-dimensional censored quantile regression problem in multiple myeloma genomics, this paper makes three contributions. First, we establish a score test-based screening framework, which is widely applicable, extremely computationally efficient, and relatively simple to justify. Secondly, we propose a resampling-based procedure for selecting the number of variables to retain after screening according to the principle of reproducibility. Finally, we propose a new iterative score test screening method which is closely related to sparse regression. In simulations we apply our methods to four dierent regression models and show that they can outperform existing procedures. We also apply score test screening to an analysis of gene expression data from multiple myeloma patients using a censored quantile regression model to identify high-risk genes
RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Power and reproducibility are key to enabling refined scientific discoveries
in contemporary big data applications with general high-dimensional nonlinear
models. In this paper, we provide theoretical foundations on the power and
robustness for the model-free knockoffs procedure introduced recently in
Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the
covariate distribution is characterized by Gaussian graphical model. We
establish that under mild regularity conditions, the power of the oracle
knockoffs procedure with known covariate distribution in high-dimensional
linear models is asymptotically one as sample size goes to infinity. When
moving away from the ideal case, we suggest the modified model-free knockoffs
method called graphical nonlinear knockoffs (RANK) to accommodate the unknown
covariate distribution. We provide theoretical justifications on the robustness
of our modified procedure by showing that the false discovery rate (FDR) is
asymptotically controlled at the target level and the power is asymptotically
one with the estimated covariate distribution. To the best of our knowledge,
this is the first formal theoretical result on the power for the knockoffs
procedure. Simulation results demonstrate that compared to existing approaches,
our method performs competitively in both FDR control and power. A real data
set is analyzed to further assess the performance of the suggested knockoffs
procedure.Comment: 37 pages, 6 tables, 9 pages supplementary materia
Semiparametric Ultra-High Dimensional Model Averaging of Nonlinear Dynamic Time Series
We propose two semiparametric model averaging schemes for nonlinear dynamic time series
regression models with a very large number of covariates including exogenous regressors and autoregressive
lags. Our objective is to obtain more accurate estimates and forecasts of time series by using
a large number of conditioning variables in a nonparametric way. In the first scheme, we introduce a
Kernel Sure Independence Screening (KSIS) technique to screen out the regressors whose marginal
regression (or auto-regression) functions do not make a significant contribution to estimating the
joint multivariate regression function; we then propose a semiparametric penalized method of Model
Averaging MArginal Regression (MAMAR) for the regressors and auto-regressors that survive the
screening procedure, to further select the regressors that have significant effects on estimating the
multivariate regression function and predicting the future values of the response variable. In the
second scheme, we impose an approximate factor modelling structure on the ultra-high dimensional
exogenous regressors and use the principal component analysis to estimate the latent common factors;
we then apply the penalized MAMAR method to select the estimated common factors and the
lags of the response variable that are significant. In each of the two schemes, we construct the
optimal combination of the significant marginal regression and auto-regression functions. Asymptotic
properties for these two schemes are derived under some regularity conditions. Numerical studies
including both simulation and an empirical application to forecasting inflation are given to illustrate
the proposed methodolog
Robust rank correlation based screening
Independence screening is a variable selection method that uses a ranking
criterion to select significant variables, particularly for statistical models
with nonpolynomial dimensionality or "large p, small n" paradigms when p can be
as large as an exponential of the sample size n. In this paper we propose a
robust rank correlation screening (RRCS) method to deal with ultra-high
dimensional data. The new procedure is based on the Kendall \tau correlation
coefficient between response and predictor variables rather than the Pearson
correlation of existing methods. The new method has four desirable features
compared with existing independence screening methods. First, the sure
independence screening property can hold only under the existence of a second
order moment of predictor variables, rather than exponential tails or
alikeness, even when the number of predictor variables grows as fast as
exponentially of the sample size. Second, it can be used to deal with
semiparametric models such as transformation regression models and single-index
models under monotonic constraint to the link function without involving
nonparametric estimation even when there are nonparametric functions in the
models. Third, the procedure can be largely used against outliers and influence
points in the observations. Last, the use of indicator functions in rank
correlation screening greatly simplifies the theoretical derivation due to the
boundedness of the resulting statistics, compared with previous studies on
variable screening. Simulations are carried out for comparisons with existing
methods and a real data example is analyzed.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1024 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org). arXiv admin note: text overlap with
arXiv:0903.525
- …