119 research outputs found

    Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

    Full text link
    This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local L0L_0-penalized Cox regression via repeatedly performing reweighted L2L_2-penalized Cox regression. We show that the resulting estimator enjoys the best of L0L_0- and L2L_2-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive L2L_2-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies

    Statistical Inference for Diverging Number of Parameters beyond Linear Regression

    Full text link
    In the big data era, regression models with a large number of covariates have emerged as a common tool to tackle problems arising from business, engineering, genomics, neuroimaging, and epidemiological studies. Drawing statistical inference for these models has sparked much interest over the past few years. Albeit successful for high dimensional linear models, high dimensional inference approaches beyond linear regression are limited and present unsatisfactory performance, theoretically or numerically. In this dissertation, we focus on de-biased lasso, which has been one of the most popular methods for high dimensional inferences. We propose procedures that provide better bias correction and confidence interval coverage, and draw reliable inference for regression parameters in the "large n, diverging p" scenario. In general, we caution against applying de-biased lasso and its variants to models beyond linear regression when parameters outnumber the sample size. Following an overview outlined in Chapter I, we focus on the generalized linear models (GLMs) in Chapter II. Extensive numerical simulations indicate that de-biased lasso may not adequately remove biases for high dimensional GLMs, and thus yield unreliable confidence intervals. We have further found that several key assumptions, especially the sparsity condition on the inverse Hessian matrix, may not hold for GLMs. In a "large n, diverging p" scenario, we consider an alternative de-biased lasso approach that inverts the Hessian matrix of the concerned model without requiring matrix sparsity, and establish the asymptotic distributions of linear combinations of the estimates. Simulations evidence that our proposed de-biased estimator performs better in bias correction and confidence interval coverage for a wide range of p/n ratios. We apply our method to the Boston Lung Cancer Study, an epidemiology study on the mechanisms underlying lung cancer, and investigate the joint effects of genetic variants on overall lung cancer risks. In Chapter III, we draw inference based on the Cox proportional hazards model with a diverging number of covariates. As the existing methods assume sparsity on the inverse of the Fisher information matrix, which may not hold for Cox models, they typically generate biased estimates and under-covered confidence intervals. We modify de-biased lasso by using quadratic programming to approximate the inverse of the information matrix, without posing matrix sparsity assumptions. We establish the asymptotic theory for the estimated regression coefficients when the covariate dimension diverges with the sample size. With extensive simulations, our proposed method provides consistent estimates and confidence intervals with improved coverage probabilities. We apply the proposed method to assess the effects of genetic markers on overall survival of non-small cell lung cancer patients in the aforementioned Boston Lung Cancer Study. Stratified Cox proportional hazards model, with extensive applications in large scale cohort studies, are useful when some covariates violate the proportional hazards assumption or data are stratified based on factors, such as transplant centers. In Chapter IV, we extend the de-biased lasso approach proposed in Chapter III to draw inference for the stratified Cox model with potentially many covariates. We provide asymptotic results useful for inference on linear combinations of the regression parameters, and demonstrate its utility via simulation studies. We apply the method to analyze the national kidney transplantation data stratified by transplant center, and assess the effects of many factors on graft survival.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162934/1/luxia_1.pd

    VARIABLE SELECTION FOR CASE-COHORT STUDIES WITH FAILURE TIME OUTCOME

    Get PDF
    Case-cohort design is widely used in large cohort studies with failure time data to reduce the cost associated with covariate measurement. Many of those studies collect a large number of covariates. Therefore, an efficient variable selection method is needed for the case-cohort design. In this dissertation, we study the properties of the Smoothly Clipped Absolute Deviation (SCAD) penalty based variable selection procedure in Cox proportional hazards model and additive hazards model in a case-cohort design with a diverging number of parameters. We prove that the SCAD penalized variable selection procedure can identify the true model with probability tending to one under Cox proportional hazards model. We then establish the consistency and asymptotic normality of the penalized estimator. We show via simulation that the BIC-based tuning parameter selection method outperforms the AIC-based method under typical case-cohort study settings. The proposed procedure is applied to the Busselton Health Study (Cullen1972, knuimanserum2003). Additive hazards model is a useful alternative to the Cox model for analyzing failure time data. In the second part of the dissertation, we extend the SCAD-penalized variable selection procedure to the additive hazards model with a stratified case-cohort design and a diverging number of parameters. We again establish variable selection consistency, estimation consistency, and asymptotic normality of the penalized estimator under this setting. We propose a new tuning parameter selection method and evaluate its performance via simulation. We show that the proposed tuning parameter selection method outperforms the conventional k-fold cross-validation method. The proposed procedure is applied to the Atherosclerosis Risk in Communities (ARIC) study (ARIC2004). Tuning parameter selection is critical to the success of a regularized variable selection method. A consistent tuning parameter selection method has not been established for the SCAD-penalized Cox model with a diverging dimension. In the last part of the dissertation, we propose a generalized information criterion (GIC) for tuning parameter selection and establish conditions required for its variable selection consistency under this setting. Simulation study shows that GIC performs well under the required conditions with finite sample size. It is then applied to the Framingham Heart Study (Framingham).Doctor of Philosoph

    The Lasso for High-Dimensional Regression with a Possible Change-Point

    Full text link
    We consider a high-dimensional regression model with a possible change-point due to a covariate threshold and develop the Lasso estimator of regression coefficients as well as the threshold parameter. Our Lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non-asymptotic oracle inequalities for both the prediction risk and the ℓ1\ell_1 estimation loss for regression coefficients. Since the Lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a nearly n−1n^{-1} factor even when the number of regressors can be much larger than the sample size (nn). We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data
    • …
    corecore