24 research outputs found
Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation
While several papers have investigated computationally and statistically
efficient methods for learning Gaussian mixtures, precise minimax bounds for
their statistical performance as well as fundamental limits in high-dimensional
settings are not well-understood. In this paper, we provide precise information
theoretic bounds on the clustering accuracy and sample complexity of learning a
mixture of two isotropic Gaussians in high dimensions under small mean
separation. If there is a sparse subset of relevant dimensions that determine
the mean separation, then the sample complexity only depends on the number of
relevant dimensions and mean separation, and can be achieved by a simple
computationally efficient procedure. Our results provide the first step of a
theoretical basis for recent methods that combine feature selection and
clustering
Feature Selection For High-Dimensional Clustering
We present a nonparametric method for selecting informative features in
high-dimensional clustering problems. We start with a screening step that uses
a test for multimodality. Then we apply kernel density estimation and mode
clustering to the selected features. The output of the method consists of a
list of relevant features, and cluster assignments. We provide explicit bounds
on the error rate of the resulting clustering. In addition, we provide the
first error bounds on mode based clustering.Comment: 11 pages, 2 figure
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e.
when the number of data points may be much smaller than the number of
dimensions. Specifically, we consider a Gaussian mixture model (GMM) with
non-spherical Gaussian components, where the clusters are distinguished by only
a few relevant dimensions. The method we propose is a combination of a recent
approach for learning parameters of a Gaussian mixture model and sparse linear
discriminant analysis (LDA). In addition to cluster assignments, the method
returns an estimate of the set of features relevant for clustering. Our results
indicate that the sample complexity of clustering depends on the sparsity of
the relevant feature set, while only scaling logarithmically with the ambient
dimension. Additionally, we require much milder assumptions than existing work
on clustering in high dimensions. In particular, we do not require spherical
clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur
Density-sensitive semisupervised inference
Semisupervised methods are techniques for using labeled data
together with unlabeled data
to make predictions. These methods invoke some assumptions that link the
marginal distribution of X to the regression function f(x). For example,
it is common to assume that f is very smooth over high density regions of
. Many of the methods are ad-hoc and have been shown to work in specific
examples but are lacking a theoretical foundation. We provide a minimax
framework for analyzing semisupervised methods. In particular, we study methods
based on metrics that are sensitive to the distribution . Our model
includes a parameter that controls the strength of the semisupervised
assumption. We then use the data to adapt to .Comment: Published in at http://dx.doi.org/10.1214/13-AOS1092 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
Abstract We consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Further, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions
Subspace Detection of High-Dimensional Vectors using Compressing Sampling
<p>We consider the problem of detecting whether a high dimensional vector β β<sup>n</sup> lies in a r-dimensional subspace S, where r βͺ n, given few compressive measurements of the vector. This problem arises in several applications such as detecting anomalies, targets, interference and brain activations. In these applications, the object of interest is described by a large number of features and the ability to detect them using only linear combination of the features (without the need to measure, store or compute the entire feature vector) is desirable. We present a test statistic for subspace detection using compressive samples and demonstrate that the probability of error of the proposed detector decreases exponentially in the number of compressive samples, provided that the energy off the subspace scales as n. Using information-theoretic lower bounds, we demonstrate that no other detector can achieve the same probability of error for weaker signals. Simulation results also indicate that this scaling is near-optimal.</p