7 research outputs found
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e.
when the number of data points may be much smaller than the number of
dimensions. Specifically, we consider a Gaussian mixture model (GMM) with
non-spherical Gaussian components, where the clusters are distinguished by only
a few relevant dimensions. The method we propose is a combination of a recent
approach for learning parameters of a Gaussian mixture model and sparse linear
discriminant analysis (LDA). In addition to cluster assignments, the method
returns an estimate of the set of features relevant for clustering. Our results
indicate that the sample complexity of clustering depends on the sparsity of
the relevant feature set, while only scaling logarithmically with the ambient
dimension. Additionally, we require much milder assumptions than existing work
on clustering in high dimensions. In particular, we do not require spherical
clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
Abstract We consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Further, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions
Adaptive Clustering through Semidefinite Programming
We analyze the clustering problem through a flexible probabilistic model that
aims to identify an optimal partition on the sample X 1 , ..., X n. We perform
exact clustering with high probability using a convex semidefinite estimator
that interprets as a corrected, relaxed version of K-means. The estimator is
analyzed through a non-asymptotic framework and showed to be optimal or
near-optimal in recovering the partition. Furthermore, its performances are
shown to be adaptive to the problem's effective dimension, as well as to K the
unknown number of groups in this partition. We illustrate the method's
performances in comparison to other classical clustering algorithms with
numerical experiments on simulated data
Detection and Feature Selection in Sparse Mixture Models
We consider Gaussian mixture models in high dimensions and concentrate on the
twin tasks of detection and feature selection. Under sparsity assumptions on
the difference in means, we derive information bounds and establish the
performance of various procedures, including the top sparse eigenvalue of the
sample covariance matrix and other projection tests based on moments, such as
the skewness and kurtosis tests of Malkovich and Afifi (1973), and other
variants which we were better able to control under the null.Comment: 70 page
Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning
We propose a new method for high-dimensional semi-supervised learning
problems based on the careful aggregation of the results of a low-dimensional
procedure applied to many axis-aligned random projections of the data. Our
primary goal is to identify important variables for distinguishing between the
classes; existing low-dimensional methods can then be applied for final class
assignment. Motivated by a generalized Rayleigh quotient, we score projections
according to the traces of the estimated whitened between-class covariance
matrices on the projected data. This enables us to assign an importance weight
to each variable for a given projection, and to select our signal variables by
aggregating these weights over high-scoring projections. Our theory shows that
the resulting Sharp-SSL algorithm is able to recover the signal coordinates
with high probability when we aggregate over sufficiently many random
projections and when the base procedure estimates the whitened between-class
covariance matrix sufficiently well. The Gaussian EM algorithm is a natural
choice as a base procedure, and we provide a new analysis of its performance in
semi-supervised settings that controls the parameter estimation error in terms
of the proportion of labeled data in the sample. Numerical results on both
simulated data and a real colon tumor dataset support the excellent empirical
performance of the method.Comment: 49 pages, 4 figure