255 research outputs found
Adaptive kernel estimation of the baseline function in the Cox model, with high-dimensional covariates
The aim of this article is to propose a novel kernel estimator of the
baseline function in a general high-dimensional Cox model, for which we derive
non-asymptotic rates of convergence. To construct our estimator, we first
estimate the regression parameter in the Cox model via a Lasso procedure. We
then plug this estimator into the classical kernel estimator of the baseline
function, obtained by smoothing the so-called Breslow estimator of the
cumulative baseline function. We propose and study an adaptive procedure for
selecting the bandwidth, in the spirit of Gold-enshluger and Lepski (2011). We
state non-asymptotic oracle inequalities for the final estimator, which reveal
the reduction of the rates of convergence when the dimension of the covariates
grows
Regularization for Cox's proportional hazards model with NP-dimensionality
High throughput genetic sequencing arrays with thousands of measurements per
sample and a great amount of related censored clinical data have increased
demanding need for better measurement specific model selection. In this paper
we establish strong oracle properties of nonconcave penalized methods for
nonpolynomial (NP) dimensional data with censoring in the framework of Cox's
proportional hazards model. A class of folded-concave penalties are employed
and both LASSO and SCAD are discussed specifically. We unveil the question
under which dimensionality and correlation restrictions can an oracle estimator
be constructed and grasped. It is demonstrated that nonconcave penalties lead
to significant reduction of the "irrepresentable condition" needed for LASSO
model selection consistency. The large deviation result for martingales,
bearing interests of its own, is developed for characterizing the strong oracle
property. Moreover, the nonconcave regularized estimator, is shown to achieve
asymptotically the information bound of the oracle estimator. A coordinate-wise
algorithm is developed for finding the grid of solution paths for penalized
hazard regression problems, and its performance is evaluated on simulated and
gene association study examples.Comment: Published in at http://dx.doi.org/10.1214/11-AOS911 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Oracle inequalities for the Lasso in the high-dimensional Aalen multiplicative intensity model
In a general counting process setting, we consider the problem of obtaining a
prognostic on the survival time adjusted on covariates in high-dimension.
Towards this end, we construct an estimator of the whole conditional intensity.
We estimate it by the best Cox proportional hazards model given two
dictionaries of functions. The first dictionary is used to construct an
approximation of the logarithm of the baseline hazard function and the second
to approximate the relative risk. We introduce a new data-driven weighted Lasso
procedure to estimate the unknown parameters of the best Cox model
approximating the intensity. We provide non-asymptotic oracle inequalities for
our procedure in terms of an appropriate empirical Kullback divergence. Our
results rely on an empirical Bernstein's inequality for martingales with jumps
and properties of modified self-concordant functions
Variance Estimation Using Refitted Cross-validation in Ultrahigh Dimensional Regression
Variance estimation is a fundamental problem in statistical modeling. In
ultrahigh dimensional linear regressions where the dimensionality is much
larger than sample size, traditional variance estimation techniques are not
applicable. Recent advances on variable selection in ultrahigh dimensional
linear regressions make this problem accessible. One of the major problems in
ultrahigh dimensional regression is the high spurious correlation between the
unobserved realized noise and some of the predictors. As a result, the realized
noises are actually predicted when extra irrelevant variables are selected,
leading to serious underestimate of the noise level. In this paper, we propose
a two-stage refitted procedure via a data splitting technique, called refitted
cross-validation (RCV), to attenuate the influence of irrelevant variables with
high spurious correlations. Our asymptotic results show that the resulting
procedure performs as well as the oracle estimator, which knows in advance the
mean regression function. The simulation studies lend further support to our
theoretical claims. The naive two-stage estimator which fits the selected
variables in the first stage and the plug-in one stage estimators using LASSO
and SCAD are also studied and compared. Their performances can be improved by
the proposed RCV method
Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge
This paper develops a new scalable sparse Cox regression tool for sparse
high-dimensional massive sample size (sHDMSS) survival data. The method is a
local -penalized Cox regression via repeatedly performing reweighted
-penalized Cox regression. We show that the resulting estimator enjoys the
best of - and -penalized Cox regressions while overcoming their
limitations. Specifically, the estimator is selection consistent, oracle for
parameter estimation, and possesses a grouping property for highly correlated
covariates. Simulation results suggest that when the sample size is large, the
proposed method with pre-specified tuning parameters has a comparable or better
performance than some popular penalized regression methods. More importantly,
because the method naturally enables adaptation of efficient algorithms for
massive -penalized optimization and does not require costly data driven
tuning parameter selection, it has a significant computational advantage for
sHDMSS data, offering an average of 5-fold speedup over its closest competitor
in empirical studies
- …