57 research outputs found

    Learning Mixtures of Distributions over Large Discrete Domains

    Get PDF
    We discuss recent results giving algorithms for learning mixtures of unstructured distributions

    Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

    Full text link
    We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur

    On Convergence of Epanechnikov Mean Shift

    Full text link
    Epanechnikov Mean Shift is a simple yet empirically very effective algorithm for clustering. It localizes the centroids of data clusters via estimating modes of the probability distribution that generates the data points, using the `optimal' Epanechnikov kernel density estimator. However, since the procedure involves non-smooth kernel density functions, the convergence behavior of Epanechnikov mean shift lacks theoretical support as of this writing---most of the existing analyses are based on smooth functions and thus cannot be applied to Epanechnikov Mean Shift. In this work, we first show that the original Epanechnikov Mean Shift may indeed terminate at a non-critical point, due to the non-smoothness nature. Based on our analysis, we propose a simple remedy to fix it. The modified Epanechnikov Mean Shift is guaranteed to terminate at a local maximum of the estimated density, which corresponds to a cluster centroid, within a finite number of iterations. We also propose a way to avoid running the Mean Shift iterates from every data point, while maintaining good clustering accuracies under non-overlapping spherical Gaussian mixture models. This further pushes Epanechnikov Mean Shift to handle very large and high-dimensional data sets. Experiments show surprisingly good performance compared to the Lloyd's K-means algorithm and the EM algorithm.Comment: AAAI 201

    A Spectral Algorithm for Latent Dirichlet Allocation

    Get PDF
    Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. The increased representational power comes at the cost of a more challenging unsupervised learning problem for estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space

    Multi-View Clustering via Canonical Correlation Analysis

    Get PDF
    Clustering data in high-dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Such techniques typically require stringent requirements on the separation between the cluster means (in order for the algorithm to be be successful). Here, we show how using multiple views of the data can relax these stringent requirements. We use Canonical Correlation Analysis (CCA) to project the data in each view to a lower-dimensional subspace. Under the assumption that conditioned on the cluster label the views are uncorrelated, we show that the separation conditions required for the algorithm to be successful are rather mild (significantly weaker than those of prior results in the literature). We provide results for mixture of Gaussians, mixtures of log concave distributions, and mixtures of product distributions

    How to Round Subspaces: A New Spectral Clustering Algorithm

    Full text link
    A basic problem in spectral clustering is the following. If a solution obtained from the spectral relaxation is close to an integral solution, is it possible to find this integral solution even though they might be in completely different basis? In this paper, we propose a new spectral clustering algorithm. It can recover a kk-partition such that the subspace corresponding to the span of its indicator vectors is O(opt)O(\sqrt{opt}) close to the original subspace in spectral norm with optopt being the minimum possible (opt≤1opt \le 1 always). Moreover our algorithm does not impose any restriction on the cluster sizes. Previously, no algorithm was known which could find a kk-partition closer than o(k⋅opt)o(k \cdot opt). We present two applications for our algorithm. First one finds a disjoint union of bounded degree expanders which approximate a given graph in spectral norm. The second one is for approximating the sparsest kk-partition in a graph where each cluster have expansion at most ϕk\phi_k provided ϕk≤O(λk+1)\phi_k \le O(\lambda_{k+1}) where λk+1\lambda_{k+1} is the (k+1)st(k+1)^{st} eigenvalue of Laplacian matrix. This significantly improves upon the previous algorithms, which required ϕk≤O(λk+1/k)\phi_k \le O(\lambda_{k+1}/k).Comment: Appeared in SODA 201
    • …
    corecore