13,182 research outputs found
Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables
Clustering analysis is one of the most widely used statistical tools in many
emerging areas such as microarray data analysis. For microarray and other
high-dimensional data, the presence of many noise variables may mask underlying
clustering structures. Hence removing noise variables via variable selection is
necessary. For simultaneous variable selection and parameter estimation,
existing penalized likelihood approaches in model-based clustering analysis all
assume a common diagonal covariance matrix across clusters, which however may
not hold in practice. To analyze high-dimensional data, particularly those with
relatively low sample sizes, this article introduces a novel approach that
shrinks the variances together with means, in a more general situation with
cluster-specific (diagonal) covariance matrices. Furthermore, selection of
grouped variables via inclusion or exclusion of a group of variables altogether
is permitted by a specific form of penalty, which facilitates incorporating
subject-matter knowledge, such as gene functions in clustering microarray
samples for disease subtype discovery. For implementation, EM algorithms are
derived for parameter estimation, in which the M-steps clearly demonstrate the
effects of shrinkage and thresholding. Numerical examples, including an
application to acute leukemia subtype discovery with microarray gene expression
data, are provided to demonstrate the utility and advantage of the proposed
method.Comment: Published in at http://dx.doi.org/10.1214/08-EJS194 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Discriminative variable selection for clustering with the sparse Fisher-EM algorithm
The interest in variable selection for clustering has increased recently due
to the growing need in clustering high-dimensional data. Variable selection
allows in particular to ease both the clustering and the interpretation of the
results. Existing approaches have demonstrated the efficiency of variable
selection for clustering but turn out to be either very time consuming or not
sparse enough in high-dimensional spaces. This work proposes to perform a
selection of the discriminative variables by introducing sparsity in the
loading matrix of the Fisher-EM algorithm. This clustering method has been
recently proposed for the simultaneous visualization and clustering of
high-dimensional data. It is based on a latent mixture model which fits the
data into a low-dimensional discriminative subspace. Three different approaches
are proposed in this work to introduce sparsity in the orientation matrix of
the discriminative subspace through -type penalizations. Experimental
comparisons with existing approaches on simulated and real-world data sets
demonstrate the interest of the proposed methodology. An application to the
segmentation of hyperspectral images of the planet Mars is also presented
Regularized Maximum Likelihood Estimation and Feature Selection in Mixtures-of-Experts Models
Mixture of Experts (MoE) are successful models for modeling heterogeneous
data in many statistical learning problems including regression, clustering and
classification. Generally fitted by maximum likelihood estimation via the
well-known EM algorithm, their application to high-dimensional problems is
still therefore challenging. We consider the problem of fitting and feature
selection in MoE models, and propose a regularized maximum likelihood
estimation approach that encourages sparse solutions for heterogeneous
regression data models with potentially high-dimensional predictors. Unlike
state-of-the art regularized MLE for MoE, the proposed modelings do not require
an approximate of the penalty function. We develop two hybrid EM algorithms: an
Expectation-Majorization-Maximization (EM/MM) algorithm, and an EM algorithm
with coordinate ascent algorithm. The proposed algorithms allow to
automatically obtaining sparse solutions without thresholding, and avoid matrix
inversion by allowing univariate parameter updates. An experimental study shows
the good performance of the algorithms in terms of recovering the actual sparse
solutions, parameter estimation, and clustering of heterogeneous regression
data
High dimensional Sparse Gaussian Graphical Mixture Model
This paper considers the problem of networks reconstruction from
heterogeneous data using a Gaussian Graphical Mixture Model (GGMM). It is well
known that parameter estimation in this context is challenging due to large
numbers of variables coupled with the degeneracy of the likelihood. We propose
as a solution a penalized maximum likelihood technique by imposing an
penalty on the precision matrix. Our approach shrinks the parameters thereby
resulting in better identifiability and variable selection. We use the
Expectation Maximization (EM) algorithm which involves the graphical LASSO to
estimate the mixing coefficients and the precision matrices. We show that under
certain regularity conditions the Penalized Maximum Likelihood (PML) estimates
are consistent. We demonstrate the performance of the PML estimator through
simulations and we show the utility of our method for high dimensional data
analysis in a genomic application
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Sparse modeling of categorial explanatory variables
Shrinking methods in regression analysis are usually designed for metric
predictors. In this article, however, shrinkage methods for categorial
predictors are proposed. As an application we consider data from the Munich
rent standard, where, for example, urban districts are treated as a categorial
predictor. If independent variables are categorial, some modifications to usual
shrinking procedures are necessary. Two -penalty based methods for factor
selection and clustering of categories are presented and investigated. The
first approach is designed for nominal scale levels, the second one for ordinal
predictors. Besides applying them to the Munich rent standard, methods are
illustrated and compared in simulation studies.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS355 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …