94 research outputs found
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Random lasso
We propose a computationally intensive method, the random lasso method, for
variable selection in linear models. The method consists of two major steps. In
step 1, the lasso method is applied to many bootstrap samples, each using a set
of randomly selected covariates. A measure of importance is yielded from this
step for each covariate. In step 2, a similar procedure to the first step is
implemented with the exception that for each bootstrap sample, a subset of
covariates is randomly selected with unequal selection probabilities determined
by the covariates' importance. Adaptive lasso may be used in the second step
with weights determined by the importance measures. The final set of covariates
and their coefficients are determined by averaging bootstrap results obtained
from step 2. The proposed method alleviates some of the limitations of lasso,
elastic-net and related methods noted especially in the context of microarray
data analysis: it tends to remove highly correlated variables altogether or
select them all, and maintains maximal flexibility in estimating their
coefficients, particularly with different signs; the number of selected
variables is no longer limited by the sample size; and the resulting prediction
accuracy is competitive or superior compared to the alternatives. We illustrate
the proposed method by extensive simulation studies. The proposed method is
also applied to a Glioblastoma microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS377 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Sparse linear discriminant analysis by thresholding for high dimensional data
In many social, economical, biological and medical studies, one objective is
to classify a subject into one of several classes based on a set of variables
observed from the subject. Because the probability distribution of the
variables is usually unknown, the rule of classification is constructed using a
training sample. The well-known linear discriminant analysis (LDA) works well
for the situation where the number of variables used for classification is much
smaller than the training sample size. Because of the advance in technologies,
modern statistical studies often face classification problems with the number
of variables much larger than the sample size, and the LDA may perform poorly.
We explore when and why the LDA has poor performance and propose a sparse LDA
that is asymptotically optimal under some sparsity conditions on the unknown
parameters. For illustration of application, we discuss an example of
classifying human cancer into two classes of leukemia based on a set of 7,129
genes and a training sample of size 72. A simulation is also conducted to check
the performance of the proposed method.Comment: Published in at http://dx.doi.org/10.1214/10-AOS870 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Doubly Regularized REML for Estimation and Selection of Fixed and Random Effects in Linear Mixed-Effects Models
The linear mixed effects model (LMM) is widely used in the analysis of clustered or longitudinal data. In the practice of LMM, the inference on the structure of the random effects component is of great importance, not only to yield proper interpretation of subject-specific effects but also to draw valid statistical conclusions. This task of inference becomes significantly challenging when a large number of fixed effects and random effects are involved in the analysis. The difficulty of variable selection arises from the need of simultaneously regularizing both mean model and covariance structures, with possible parameter constraints between the two. In this paper, we propose a novel method of doubly regularized restricted maximum likelihood to select fixed and random effects simultaneously in the LMM. The Cholesky decomposition is invoked to ensure the positive-definiteness of the selected covariance matrix of random effects, and selected random effects are invariant with respect to the ordering of predictors appearing in the Cholesky decomposition. We then develop a new algorithm that solves the related optimization problem effectively, in which the computational cost is comparable with that of the Newton-Raphson algorithm for MLE or REML in the LMM. We also investigate large sample properties for the proposed method, including the oracle property. Both simulation studies and data analysis are included for illustration
Doubly Penalized Buckley-James Method for Survival Data with High-Dimensional Covariates
Recent interest in cancer research focuses on predicting patients\u27 survival by investigating gene expression profiles based on microarray analysis. We propose a doubly penalized Buckley-James method for the semiparametric accelerated failure time model to relate high-dimensional genomic data to censored survival outcomes, which uses a mixture of L1-norm and L2-norm penalties. Similar to the elastic-net method for linear regression model with uncensored data, the proposed method performs automatic gene selection and parameter estimation, where highly correlated genes are able to be selected (or removed) together. The two-dimensional tuning parameter is determined by cross-validation and uniform design. The proposed method is evaluated by simulations and applied to the Michigan squamous cell lung carcinoma study
Simultaneous variable selection for joint models of longitudinal and survival outcomes
Joint models of longitudinal and survival outcomes have been used with increasing frequency in clinical investigations. Correct specification of fixed and random effects is essential for practical data analysis. Simultaneous selection of variables in both longitudinal and survival components functions as a necessary safeguard against model misspecification. However, variable selection in such models has not been studied. No existing computational tools, to the best of our knowledge, have been made available to practitioners. In this article, we describe a penalized likelihood method with adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions for simultaneous selection of fixed and random effects in joint models. To perform selection in variance components of random effects, we reparameterize the variance components using a Cholesky decomposition; in doing so, a penalty function of group shrinkage is introduced. To reduce the estimation bias resulted from penalization, we propose a two-stage selection procedure in which the magnitude of the bias is ameliorated in the second stage. The penalized likelihood is approximated by Gaussian quadrature and optimized by an EM algorithm. Simulation study showed excellent selection results in the first stage and small estimation biases in the second stage. To illustrate, we analyzed a longitudinally observed clinical marker and patient survival in a cohort of patients with heart failure
- …