1,653 research outputs found

    Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

    Full text link
    This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local L0L_0-penalized Cox regression via repeatedly performing reweighted L2L_2-penalized Cox regression. We show that the resulting estimator enjoys the best of L0L_0- and L2L_2-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive L2L_2-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies

    Efficient Inferential Methods in Regression Models with Change Points or High Dimensional Covariates.

    Full text link
    We present a fast approach for estimating change-point(s) in the broken-stick model efficiently in both cross-sectional and longitudinal settings. Our method, based on local smoothing in a shrinking neighborhood of each change-point, is shown via simulations to be computationally more viable than existing methods that rely on search procedures. The proposed estimates are shown to have root-n-consistency and asymptotic normality. As our motivating application, we study the MBHMS cohort data to describe patterns of change in log estradiol levels around the final menstrual period, for which a two change-points broken-stick model is a good fit. Though there has been a considerable work done on studying the effects of coarse and fine ambient particles, how the constituent pollutants affect cardiovascular functioning is not clearly understood. We propose using multivariate adaptive elastic-net to capture these effects in an autoregressive model for time series data. Because of the large number of highly correlated pollutants, a reliable method must take into account the high dimensionality as well as multicollinearity issues. This is accomplished by using adaptive elastic-net which deals effectively with the correlated nature of the data. Furthermore, the selection consistency and asymptotic normality properties allow us to provide meaningful statistical inference. As our motivating example, we study the effects of multiple pollutants on several cardiovascular end-points in a rat-study based in Dearborn, Michigan, conducted by the GLACIER. Finally, we look at problems where there are several covariates and some of the covariates have a simple linear effect while some have a broken-stick effect. We are looking at two separate but similar types of problems: one where we have several covariates each with possibly a broken-stick effect but with only a single change-point and the other where the broken-stick effect is exhibited in a single covariate but the number of change-points is unknown. In both settings we strive for a parsimonious yet accurate model, which necessitates an effective variable selection procedure. In a sparse setting, we illustrate the difficulty in using the popular variable selection methods and propose post local-smoothing thresholded ridge regression as an effective variable selection method.PhDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113538/1/ritob_1.pd

    Variable Selection in General Multinomial Logit Models

    Get PDF
    The use of the multinomial logit model is typically restricted to applications with few predictors, because in high-dimensional settings maximum likelihood estimates tend to deteriorate. In this paper we are proposing a sparsity-inducing penalty that accounts for the special structure of multinomial models. In contrast to existing methods, it penalizes the parameters that are linked to one variable in a grouped way and thus yields variable selection instead of parameter selection. We develop a proximal gradient method that is able to efficiently compute stable estimates. In addition, the penalization is extended to the important case of predictors that vary across response categories. We apply our estimator to the modeling of party choice of voters in Germany including voter-specific variables like age and gender but also party-specific features like stance on nuclear energy and immigration

    Penalized Likelihood and Bayesian Function Selection in Regression Models

    Full text link
    Challenging research in various fields has driven a wide range of methodological advances in variable selection for regression models with high-dimensional predictors. In comparison, selection of nonlinear functions in models with additive predictors has been considered only more recently. Several competing suggestions have been developed at about the same time and often do not refer to each other. This article provides a state-of-the-art review on function selection, focusing on penalized likelihood and Bayesian concepts, relating various approaches to each other in a unified framework. In an empirical comparison, also including boosting, we evaluate several methods through applications to simulated and real data, thereby providing some guidance on their performance in practice

    Semiparametric analysis of complex longitudinal data

    Get PDF
    Event history data consist of the longitudinal records of event occurrence times. Recurrent event data and panel count data are two common types of event history data that occur in many areas, such as medical studies and social sciences. A great deal of literature has been established for their analyses. Nevertheless, only limited research exists on the variable selection for recurrent event data and panel count data. The existing methods can be seen as direct generalizations of the available penalized procedures for linear models, but may not perform as well as expected due to the complex structure of event history data. The first and second parts of this dissertation then discuss simultaneous parameter estimation and variable selection for event history data. We present a new variable selection method with a new penalty function, which will be referred to as the broken adaptive ridge regression approach. In addition to the establishment of the oracle property, we also show that the proposed variable selection method has the clustering or grouping effect when covariates are highly correlated. Furthermore, the numerical studies are performed and indicate that the method works well for practical situations and can outperform the existing methods. Applications to real data are provided. Most of the existing studies of longitudinal data assume that covariates can be observed at the same observation times for the response variable, and the observation process is independent of the response variable completely or given covariates. In practice, the response variables and covariates are sometimes observed intermittently at different time points, leading to sparse asynchronous longitudinal data. The observation process may also be related to the response variable even given covariates and sometimes both issues can even occur at the same time. Although each of the two issues has been developed to address in literature, it does not seem to exist an established approach that can deal with both together. To address both issues simultaneously, the third part of this dissertation proposes a flexible semiparametric transformation conditional model and a kernel-weighted estimating equation based approach. The proposed estimators of regression parameters are shown to be consistent and asymptotically follow the normal distribution. For the assessment of the finite sample performance of the proposed method, an extensive simulation study is carried out and suggests that it performs well for practical situations. The approach is applied to a prospective HIV study that motivated this investigation.Includes bibliographical reference
    corecore