2,000 research outputs found
On Learning Mixtures of Well-Separated Gaussians
We consider the problem of efficiently learning mixtures of a large number of
spherical Gaussians, when the components of the mixture are well separated. In
the most basic form of this problem, we are given samples from a uniform
mixture of standard spherical Gaussians, and the goal is to estimate the
means up to accuracy using samples.
In this work, we study the following question: what is the minimum separation
needed between the means for solving this task? The best known algorithm due to
Vempala and Wang [JCSS 2004] requires a separation of roughly
. On the other hand, Moitra and Valiant [FOCS 2010] showed
that with separation , exponentially many samples are required. We
address the significant gap between these two bounds, by showing the following
results.
1. We show that with separation , super-polynomially many
samples are required. In fact, this holds even when the means of the
Gaussians are picked at random in dimensions.
2. We show that with separation ,
samples suffice. Note that the bound on the separation is independent of
. This result is based on a new and efficient "accuracy boosting"
algorithm that takes as input coarse estimates of the true means and in time
outputs estimates of the means up to arbitrary accuracy
assuming the separation between the means is (independently of ).
We also present a computationally efficient algorithm in dimensions
with only separation. These results together essentially
characterize the optimal order of separation between components that is needed
to learn a mixture of spherical Gaussians with polynomial samples.Comment: Appeared in FOCS 2017. 55 pages, 1 figur
Efficient Sparse Clustering of High-Dimensional Non-spherical Gaussian Mixtures
We consider the problem of clustering data points in high dimensions, i.e.
when the number of data points may be much smaller than the number of
dimensions. Specifically, we consider a Gaussian mixture model (GMM) with
non-spherical Gaussian components, where the clusters are distinguished by only
a few relevant dimensions. The method we propose is a combination of a recent
approach for learning parameters of a Gaussian mixture model and sparse linear
discriminant analysis (LDA). In addition to cluster assignments, the method
returns an estimate of the set of features relevant for clustering. Our results
indicate that the sample complexity of clustering depends on the sparsity of
the relevant feature set, while only scaling logarithmically with the ambient
dimension. Additionally, we require much milder assumptions than existing work
on clustering in high dimensions. In particular, we do not require spherical
clusters nor necessitate mean separation along relevant dimensions.Comment: 11 pages, 1 figur
Learning mixtures of separated nonspherical Gaussians
Mixtures of Gaussian (or normal) distributions arise in a variety of
application areas. Many heuristics have been proposed for the task of finding
the component Gaussians given samples from the mixture, such as the EM
algorithm, a local-search heuristic from Dempster, Laird and Rubin [J. Roy.
Statist. Soc. Ser. B 39 (1977) 1-38]. These do not provably run in polynomial
time. We present the first algorithm that provably learns the component
Gaussians in time that is polynomial in the dimension. The Gaussians may have
arbitrary shape, but they must satisfy a ``separation condition'' which places
a lower bound on the distance between the centers of any two component
Gaussians. The mathematical results at the heart of our proof are ``distance
concentration'' results--proved using isoperimetric inequalities--which
establish bounds on the probability distribution of the distance between a pair
of points generated according to the mixture. We also formalize the more
general problem of max-likelihood fit of a Gaussian mixture to unstructured
data.Comment: Published at http://dx.doi.org/10.1214/105051604000000512 in the
Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute
of Mathematical Statistics (http://www.imstat.org
- …