119 research outputs found
Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge
This paper develops a new scalable sparse Cox regression tool for sparse
high-dimensional massive sample size (sHDMSS) survival data. The method is a
local -penalized Cox regression via repeatedly performing reweighted
-penalized Cox regression. We show that the resulting estimator enjoys the
best of - and -penalized Cox regressions while overcoming their
limitations. Specifically, the estimator is selection consistent, oracle for
parameter estimation, and possesses a grouping property for highly correlated
covariates. Simulation results suggest that when the sample size is large, the
proposed method with pre-specified tuning parameters has a comparable or better
performance than some popular penalized regression methods. More importantly,
because the method naturally enables adaptation of efficient algorithms for
massive -penalized optimization and does not require costly data driven
tuning parameter selection, it has a significant computational advantage for
sHDMSS data, offering an average of 5-fold speedup over its closest competitor
in empirical studies
Statistical Inference for Diverging Number of Parameters beyond Linear Regression
In the big data era, regression models with a large number of covariates have emerged as a common tool to tackle problems arising from business, engineering, genomics, neuroimaging, and epidemiological studies. Drawing statistical inference for these models has sparked much interest over the past few years. Albeit successful for high dimensional linear models, high dimensional inference approaches beyond linear regression are limited and present unsatisfactory performance, theoretically or numerically. In this dissertation, we focus on de-biased lasso, which has been one of the most popular methods for high dimensional inferences. We propose procedures that provide better bias correction and confidence interval coverage, and draw reliable inference for regression parameters in the "large n, diverging p" scenario. In general, we caution against applying de-biased lasso and its variants to models beyond linear regression when parameters outnumber the sample size.
Following an overview outlined in Chapter I, we focus on the generalized linear models (GLMs) in Chapter II. Extensive numerical simulations indicate that de-biased lasso may not adequately remove biases for high dimensional GLMs, and thus yield unreliable confidence intervals. We have further found that several key assumptions, especially the sparsity condition on the inverse Hessian matrix, may not hold for GLMs. In a "large n, diverging p" scenario, we consider an alternative de-biased lasso approach that inverts the Hessian matrix of the concerned model without requiring matrix sparsity, and establish the asymptotic distributions of linear combinations of the estimates. Simulations evidence that our proposed de-biased estimator performs better in bias correction and confidence interval coverage for a wide range of p/n ratios. We apply our method to the Boston Lung Cancer Study, an epidemiology study on the mechanisms underlying lung cancer, and investigate the joint effects of genetic variants on overall lung cancer risks.
In Chapter III, we draw inference based on the Cox proportional hazards model with a diverging number of covariates. As the existing methods assume sparsity on the inverse of the Fisher information matrix, which may not hold for Cox models, they typically generate biased estimates and under-covered confidence intervals. We modify de-biased lasso by using quadratic programming to approximate the inverse of the information matrix, without posing matrix sparsity assumptions. We establish the asymptotic theory for the estimated regression coefficients when the covariate dimension diverges with the sample size. With extensive simulations, our proposed method provides consistent estimates and confidence intervals with improved coverage probabilities. We apply the proposed method to assess the effects of genetic markers on overall survival of non-small cell lung cancer patients in the aforementioned Boston Lung Cancer Study.
Stratified Cox proportional hazards model, with extensive applications in large scale cohort studies, are useful when some covariates violate the proportional hazards assumption or data are stratified based on factors, such as transplant centers. In Chapter IV, we extend the de-biased lasso approach proposed in Chapter III to draw inference for the stratified Cox model with potentially many covariates. We provide asymptotic results useful for inference on linear combinations of the regression parameters, and demonstrate its utility via simulation studies. We apply the method to analyze the national kidney transplantation data stratified by transplant center, and assess the effects of many factors on graft survival.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162934/1/luxia_1.pd
VARIABLE SELECTION FOR CASE-COHORT STUDIES WITH FAILURE TIME OUTCOME
Case-cohort design is widely used in large cohort studies with failure time data to reduce the cost associated with covariate measurement. Many of those studies collect a large number of covariates. Therefore, an efficient variable selection method is needed for the case-cohort design. In this dissertation, we study the properties of the Smoothly Clipped Absolute Deviation (SCAD) penalty based variable selection procedure in Cox proportional hazards model and additive hazards model in a case-cohort design with a diverging number of parameters. We prove that the SCAD penalized variable selection procedure can identify the true model with probability tending to one under Cox proportional hazards model. We then establish the consistency and asymptotic normality of the penalized estimator. We show via simulation that the BIC-based tuning parameter selection method outperforms the AIC-based method under typical case-cohort study settings. The proposed procedure is applied to the Busselton Health Study (Cullen1972, knuimanserum2003). Additive hazards model is a useful alternative to the Cox model for analyzing failure time data. In the second part of the dissertation, we extend the SCAD-penalized variable selection procedure to the additive hazards model with a stratified case-cohort design and a diverging number of parameters. We again establish variable selection consistency, estimation consistency, and asymptotic normality of the penalized estimator under this setting. We propose a new tuning parameter selection method and evaluate its performance via simulation. We show that the proposed tuning parameter selection method outperforms the conventional k-fold cross-validation method. The proposed procedure is applied to the Atherosclerosis Risk in Communities (ARIC) study (ARIC2004). Tuning parameter selection is critical to the success of a regularized variable selection method. A consistent tuning parameter selection method has not been established for the SCAD-penalized Cox model with a diverging dimension. In the last part of the dissertation, we propose a generalized information criterion (GIC) for tuning parameter selection and establish conditions required for its variable selection consistency under this setting. Simulation study shows that GIC performs well under the required conditions with finite sample size. It is then applied to the Framingham Heart Study (Framingham).Doctor of Philosoph
The Lasso for High-Dimensional Regression with a Possible Change-Point
We consider a high-dimensional regression model with a possible change-point
due to a covariate threshold and develop the Lasso estimator of regression
coefficients as well as the threshold parameter. Our Lasso estimator not only
selects covariates but also selects a model between linear and threshold
regression models. Under a sparsity assumption, we derive non-asymptotic oracle
inequalities for both the prediction risk and the estimation loss for
regression coefficients. Since the Lasso estimator selects variables
simultaneously, we show that oracle inequalities can be established without
pretesting the existence of the threshold effect. Furthermore, we establish
conditions under which the estimation error of the unknown threshold parameter
can be bounded by a nearly factor even when the number of regressors
can be much larger than the sample size (). We illustrate the usefulness of
our proposed estimation method via Monte Carlo simulations and an application
to real data
Recommended from our members
Statistical Analysis of Complex Data in Survival and Event History Analysis
This thesis studies two aspects of the statistical analysis of complex data in survival and event history analysis. After a short introduction to survival and event history analysis in Chapter 1, we proposed a multivariate proportional intensity factor model for multivariate counting processes in Chapter 2. In an exploratory analysis on process data, a large number of possibly time-varying covariates maybe included. These covariates along with the high-dimensional counting processes often exhibit a low-dimensional structure that has meaningful interpretation. We explore such structure through specifying random coefficients in a low dimensional space through a factor model. For the estimation of the resulting model, we establish the asymptotic theory of the nonparametric maximum likelihood estimator (NPMLE). In particular, the NPMLE is consistent, asymptotically normal and asymptotically efficient with covariance matrix that can be consistently estimated by the inverse information matrix or the profile likelihood method under some suitable regularity conditions. Furthermore, to obtain a parsimonious model and to improve interpretation of parameters therein, variable selection and estimation for both fixed and random effects are developed by penalized likelihood. We illustrate the method using simulation studies as well as a real data application from The Programme for the International Assessment of Adult Competencies (PIAAC). Chapter 3 concerns rare events and sparse covariates in event history analysis. In large-scale longitudinal observational databases, the majority of subjects may not experience a particular event of interest. Furthermore, the associated covariate processes could also be zero for most of the subjects at any time. We formulate such setting of rare events and sparse covariates under the proportional intensity model and establish the validity of using the partial likelihood estimator and the observed information matrix for inference under this framework
- …