24 research outputs found

    Minimax Theory for High-dimensional Gaussian Mixtures with Sparse Mean Separation

    Full text link
    While several papers have investigated computationally and statistically efficient methods for learning Gaussian mixtures, precise minimax bounds for their statistical performance as well as fundamental limits in high-dimensional settings are not well-understood. In this paper, we provide precise information theoretic bounds on the clustering accuracy and sample complexity of learning a mixture of two isotropic Gaussians in high dimensions under small mean separation. If there is a sparse subset of relevant dimensions that determine the mean separation, then the sample complexity only depends on the number of relevant dimensions and mean separation, and can be achieved by a simple computationally efficient procedure. Our results provide the first step of a theoretical basis for recent methods that combine feature selection and clustering

    Feature Selection For High-Dimensional Clustering

    Full text link
    We present a nonparametric method for selecting informative features in high-dimensional clustering problems. We start with a screening step that uses a test for multimodality. Then we apply kernel density estimation and mode clustering to the selected features. The output of the method consists of a list of relevant features, and cluster assignments. We provide explicit bounds on the error rate of the resulting clustering. In addition, we provide the first error bounds on mode based clustering.Comment: 11 pages, 2 figure

    Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

    Full text link
    We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur

    Density-sensitive semisupervised inference

    Full text link
    Semisupervised methods are techniques for using labeled data (X1,Y1),…,(Xn,Yn)(X_1,Y_1),\ldots,(X_n,Y_n) together with unlabeled data Xn+1,…,XNX_{n+1},\ldots,X_N to make predictions. These methods invoke some assumptions that link the marginal distribution PXP_X of X to the regression function f(x). For example, it is common to assume that f is very smooth over high density regions of PXP_X. Many of the methods are ad-hoc and have been shown to work in specific examples but are lacking a theoretical foundation. We provide a minimax framework for analyzing semisupervised methods. In particular, we study methods based on metrics that are sensitive to the distribution PXP_X. Our model includes a parameter Ξ±\alpha that controls the strength of the semisupervised assumption. We then use the data to adapt to Ξ±\alpha.Comment: Published in at http://dx.doi.org/10.1214/13-AOS1092 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

    Get PDF
    Abstract We consider the problem of clustering data points in high dimensions, i.e., when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with two non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Further, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions

    Subspace Detection of High-Dimensional Vectors using Compressing Sampling

    No full text
    <p>We consider the problem of detecting whether a high dimensional vector ∈ ℝ<sup>n</sup> lies in a r-dimensional subspace S, where r β‰ͺ n, given few compressive measurements of the vector. This problem arises in several applications such as detecting anomalies, targets, interference and brain activations. In these applications, the object of interest is described by a large number of features and the ability to detect them using only linear combination of the features (without the need to measure, store or compute the entire feature vector) is desirable. We present a test statistic for subspace detection using compressive samples and demonstrate that the probability of error of the proposed detector decreases exponentially in the number of compressive samples, provided that the energy off the subspace scales as n. Using information-theoretic lower bounds, we demonstrate that no other detector can achieve the same probability of error for weaker signals. Simulation results also indicate that this scaling is near-optimal.</p
    corecore