Search CORE

119 research outputs found

Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

Author: Kawaguchi Eric S.
Li Gang
Liu Zhenqiu
Suchard Marc A.
Publication venue: 'Wiley'
Publication date: 25/07/2018
Field of study

This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local

L_0

-penalized Cox regression via repeatedly performing reweighted

L_2

-penalized Cox regression. We show that the resulting estimator enjoys the best of

L_0

- and

L_2

-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive

L_2

-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies

arXiv.org e-Print Archive

eScholarship - University of California

Statistical Inference for Diverging Number of Parameters beyond Linear Regression

Author: Xia Lu
Publication venue
Publication date: 01/01/2020
Field of study

In the big data era, regression models with a large number of covariates have emerged as a common tool to tackle problems arising from business, engineering, genomics, neuroimaging, and epidemiological studies. Drawing statistical inference for these models has sparked much interest over the past few years. Albeit successful for high dimensional linear models, high dimensional inference approaches beyond linear regression are limited and present unsatisfactory performance, theoretically or numerically. In this dissertation, we focus on de-biased lasso, which has been one of the most popular methods for high dimensional inferences. We propose procedures that provide better bias correction and confidence interval coverage, and draw reliable inference for regression parameters in the "large n, diverging p" scenario. In general, we caution against applying de-biased lasso and its variants to models beyond linear regression when parameters outnumber the sample size. Following an overview outlined in Chapter I, we focus on the generalized linear models (GLMs) in Chapter II. Extensive numerical simulations indicate that de-biased lasso may not adequately remove biases for high dimensional GLMs, and thus yield unreliable confidence intervals. We have further found that several key assumptions, especially the sparsity condition on the inverse Hessian matrix, may not hold for GLMs. In a "large n, diverging p" scenario, we consider an alternative de-biased lasso approach that inverts the Hessian matrix of the concerned model without requiring matrix sparsity, and establish the asymptotic distributions of linear combinations of the estimates. Simulations evidence that our proposed de-biased estimator performs better in bias correction and confidence interval coverage for a wide range of p/n ratios. We apply our method to the Boston Lung Cancer Study, an epidemiology study on the mechanisms underlying lung cancer, and investigate the joint effects of genetic variants on overall lung cancer risks. In Chapter III, we draw inference based on the Cox proportional hazards model with a diverging number of covariates. As the existing methods assume sparsity on the inverse of the Fisher information matrix, which may not hold for Cox models, they typically generate biased estimates and under-covered confidence intervals. We modify de-biased lasso by using quadratic programming to approximate the inverse of the information matrix, without posing matrix sparsity assumptions. We establish the asymptotic theory for the estimated regression coefficients when the covariate dimension diverges with the sample size. With extensive simulations, our proposed method provides consistent estimates and confidence intervals with improved coverage probabilities. We apply the proposed method to assess the effects of genetic markers on overall survival of non-small cell lung cancer patients in the aforementioned Boston Lung Cancer Study. Stratified Cox proportional hazards model, with extensive applications in large scale cohort studies, are useful when some covariates violate the proportional hazards assumption or data are stratified based on factors, such as transplant centers. In Chapter IV, we extend the de-biased lasso approach proposed in Chapter III to draw inference for the stratified Cox model with potentially many covariates. We provide asymptotic results useful for inference on linear combinations of the regression parameters, and demonstrate its utility via simulation studies. We apply the method to analyze the national kidney transplantation data stratified by transplant center, and assess the effects of many factors on graft survival.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162934/1/luxia_1.pd

Deep Blue Documents at the University of Michigan

VARIABLE SELECTION FOR CASE-COHORT STUDIES WITH FAILURE TIME OUTCOME

Author: Ni Ai
Publication venue: University of North Carolina at Chapel Hill Graduate School
Publication date: 01/01/2015
Field of study

Case-cohort design is widely used in large cohort studies with failure time data to reduce the cost associated with covariate measurement. Many of those studies collect a large number of covariates. Therefore, an efficient variable selection method is needed for the case-cohort design. In this dissertation, we study the properties of the Smoothly Clipped Absolute Deviation (SCAD) penalty based variable selection procedure in Cox proportional hazards model and additive hazards model in a case-cohort design with a diverging number of parameters. We prove that the SCAD penalized variable selection procedure can identify the true model with probability tending to one under Cox proportional hazards model. We then establish the consistency and asymptotic normality of the penalized estimator. We show via simulation that the BIC-based tuning parameter selection method outperforms the AIC-based method under typical case-cohort study settings. The proposed procedure is applied to the Busselton Health Study (Cullen1972, knuimanserum2003). Additive hazards model is a useful alternative to the Cox model for analyzing failure time data. In the second part of the dissertation, we extend the SCAD-penalized variable selection procedure to the additive hazards model with a stratified case-cohort design and a diverging number of parameters. We again establish variable selection consistency, estimation consistency, and asymptotic normality of the penalized estimator under this setting. We propose a new tuning parameter selection method and evaluate its performance via simulation. We show that the proposed tuning parameter selection method outperforms the conventional k-fold cross-validation method. The proposed procedure is applied to the Atherosclerosis Risk in Communities (ARIC) study (ARIC2004). Tuning parameter selection is critical to the success of a regularized variable selection method. A consistent tuning parameter selection method has not been established for the SCAD-penalized Cox model with a diverging dimension. In the last part of the dissertation, we propose a generalized information criterion (GIC) for tuning parameter selection and establish conditions required for its variable selection consistency under this setting. Simulation study shows that GIC performs well under the required conditions with finite sample size. It is then applied to the Framingham Heart Study (Framingham).Doctor of Philosoph

Carolina Digital Repository

Variable Selection and Model Choice in Survival Models with Time-Varying Effects

Author: Hofner Benjamin
Publication venue
Publication date: 01/01/2008
Field of study

Open Access LMU

The Lasso for High-Dimensional Regression with a Possible Change-Point

Author: Lee Sokbae
Seo Myung Hwan
Shin Youngki
Publication venue: 'Wiley'
Publication date: 01/01/2014
Field of study

We consider a high-dimensional regression model with a possible change-point due to a covariate threshold and develop the Lasso estimator of regression coefficients as well as the threshold parameter. Our Lasso estimator not only selects covariates but also selects a model between linear and threshold regression models. Under a sparsity assumption, we derive non-asymptotic oracle inequalities for both the prediction risk and the

\ell_1

estimation loss for regression coefficients. Since the Lasso estimator selects variables simultaneously, we show that oracle inequalities can be established without pretesting the existence of the threshold effect. Furthermore, we establish conditions under which the estimation error of the unknown threshold parameter can be bounded by a nearly

n^{-1}

factor even when the number of regressors can be much larger than the sample size (

n

). We illustrate the usefulness of our proposed estimation method via Monte Carlo simulations and an application to real data

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Statistical Analysis of Complex Data in Survival and Event History Analysis

Author: Ling Hok Kan
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

This thesis studies two aspects of the statistical analysis of complex data in survival and event history analysis. After a short introduction to survival and event history analysis in Chapter 1, we proposed a multivariate proportional intensity factor model for multivariate counting processes in Chapter 2. In an exploratory analysis on process data, a large number of possibly time-varying covariates maybe included. These covariates along with the high-dimensional counting processes often exhibit a low-dimensional structure that has meaningful interpretation. We explore such structure through specifying random coefficients in a low dimensional space through a factor model. For the estimation of the resulting model, we establish the asymptotic theory of the nonparametric maximum likelihood estimator (NPMLE). In particular, the NPMLE is consistent, asymptotically normal and asymptotically efficient with covariance matrix that can be consistently estimated by the inverse information matrix or the profile likelihood method under some suitable regularity conditions. Furthermore, to obtain a parsimonious model and to improve interpretation of parameters therein, variable selection and estimation for both fixed and random effects are developed by penalized likelihood. We illustrate the method using simulation studies as well as a real data application from The Programme for the International Assessment of Adult Competencies (PIAAC). Chapter 3 concerns rare events and sparse covariates in event history analysis. In large-scale longitudinal observational databases, the majority of subjects may not experience a particular event of interest. Furthermore, the associated covariate processes could also be zero for most of the subjects at any time. We formulate such setting of rare events and sparse covariates under the proportional intensity model and establish the validity of using the partial likelihood estimator and the observed information matrix for inference under this framework

Columbia University Academic Commons