Search CORE

57 research outputs found

Learning Mixtures of Distributions over Large Discrete Domains

Author: Rabani Yuval
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2012)
Publication date: 01/01/2012
Field of study

We discuss recent results giving algorithms for learning mixtures of unstructured distributions

Dagstuhl Research Online Publication Server

Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures

Author: Azizyan Martin
Singh Aarti
Wasserman Larry
Publication venue
Publication date: 09/06/2014
Field of study

We consider the problem of clustering data points in high dimensions, i.e. when the number of data points may be much smaller than the number of dimensions. Specifically, we consider a Gaussian mixture model (GMM) with non-spherical Gaussian components, where the clusters are distinguished by only a few relevant dimensions. The method we propose is a combination of a recent approach for learning parameters of a Gaussian mixture model and sparse linear discriminant analysis (LDA). In addition to cluster assignments, the method returns an estimate of the set of features relevant for clustering. Our results indicate that the sample complexity of clustering depends on the sparsity of the relevant feature set, while only scaling logarithmically with the ambient dimension. Additionally, we require much milder assumptions than existing work on clustering in high dimensions. In particular, we do not require spherical clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur

arXiv.org e-Print Archive

CiteSeerX

On Convergence of Epanechnikov Mean Shift

Author: Fu Xiao
Huang Kejun
Sidiropoulos Nicholas D.
Publication venue
Publication date: 20/11/2017
Field of study

Epanechnikov Mean Shift is a simple yet empirically very effective algorithm for clustering. It localizes the centroids of data clusters via estimating modes of the probability distribution that generates the data points, using the `optimal' Epanechnikov kernel density estimator. However, since the procedure involves non-smooth kernel density functions, the convergence behavior of Epanechnikov mean shift lacks theoretical support as of this writing---most of the existing analyses are based on smooth functions and thus cannot be applied to Epanechnikov Mean Shift. In this work, we first show that the original Epanechnikov Mean Shift may indeed terminate at a non-critical point, due to the non-smoothness nature. Based on our analysis, we propose a simple remedy to fix it. The modified Epanechnikov Mean Shift is guaranteed to terminate at a local maximum of the estimated density, which corresponds to a cluster centroid, within a finite number of iterations. We also propose a way to avoid running the Mean Shift iterates from every data point, while maintaining good clustering accuracies under non-overlapping spherical Gaussian mixture models. This further pushes Epanechnikov Mean Shift to handle very large and high-dimensional data sets. Experiments show surprisingly good performance compared to the Lloyd's K-means algorithm and the EM algorithm.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A Spectral Algorithm for Latent Dirichlet Allocation

Author: Anandkumar Anima
Foster Dean P
Hsu Daniel
Kakade Sham
Liu Yi-Kai
Publication venue: ScholarlyCommons
Publication date: 01/05/2015
Field of study

Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. The increased representational power comes at the cost of a more challenging unsupervised learning problem for estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of topic models, including Latent Dirichlet Allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method, called Excess Correlation Analysis, is based on a spectral decomposition of low-order moments via two singular value decompositions (SVDs). Moreover, the algorithm is scalable, since the SVDs are carried out only on k × k matrices, where k is the number of latent factors (topics) and is typically much smaller than the dimension of the observation (word) space

CiteSeerX

Caltech Authors

ScholarlyCommons@Penn

Multi-View Clustering via Canonical Correlation Analysis

Author: Chaudhuri Kamalika
Kakade Sham M
Livescu Karen
Sridharan Karthik
Publication venue: ScholarlyCommons
Publication date: 01/01/2009
Field of study

Clustering data in high-dimensions is believed to be a hard problem in general. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional subspace, e.g. via Principal Components Analysis (PCA) or random projections, before clustering. Such techniques typically require stringent requirements on the separation between the cluster means (in order for the algorithm to be be successful). Here, we show how using multiple views of the data can relax these stringent requirements. We use Canonical Correlation Analysis (CCA) to project the data in each view to a lower-dimensional subspace. Under the assumption that conditioned on the cluster label the views are uncorrelated, we show that the separation conditions required for the algorithm to be successful are rather mild (significantly weaker than those of prior results in the literature). We provide results for mixture of Gaussians, mixtures of log concave distributions, and mixtures of product distributions

ScholarlyCommons@Penn

How to Round Subspaces: A New Spectral Clustering Algorithm

Author: Sinop Ali Kemal
Publication venue
Publication date: 19/10/2015
Field of study

A basic problem in spectral clustering is the following. If a solution obtained from the spectral relaxation is close to an integral solution, is it possible to find this integral solution even though they might be in completely different basis? In this paper, we propose a new spectral clustering algorithm. It can recover a

k

-partition such that the subspace corresponding to the span of its indicator vectors is

O(\sqrt{opt})

close to the original subspace in spectral norm with

opt

being the minimum possible (

opt \le 1

always). Moreover our algorithm does not impose any restriction on the cluster sizes. Previously, no algorithm was known which could find a

k

-partition closer than

o(k \cdot opt)

. We present two applications for our algorithm. First one finds a disjoint union of bounded degree expanders which approximate a given graph in spectral norm. The second one is for approximating the sparsest

k

-partition in a graph where each cluster have expansion at most

\phi_k

provided

\phi_k \le O(\lambda_{k+1})

where

\lambda_{k+1}

is the

(k+1)^{st}

eigenvalue of Laplacian matrix. This significantly improves upon the previous algorithms, which required

\phi_k \le O(\lambda_{k+1}/k)

.Comment: Appeared in SODA 201

arXiv.org e-Print Archive

Crossref