31 research outputs found
On the Behavior of the Expectation-Maximization Algorithm for Mixture Models
Finite mixture models are among the most popular statistical models used in
different data science disciplines. Despite their broad applicability,
inference under these models typically leads to computationally challenging
non-convex problems. While the Expectation-Maximization (EM) algorithm is the
most popular approach for solving these non-convex problems, the behavior of
this algorithm is not well understood. In this work, we focus on the case of
mixture of Laplacian (or Gaussian) distribution. We start by analyzing a simple
equally weighted mixture of two single dimensional Laplacian distributions and
show that every local optimum of the population maximum likelihood estimation
problem is globally optimal. Then, we prove that the EM algorithm converges to
the ground truth parameters almost surely with random initialization. Our
result extends the existing results for Gaussian distribution to Laplacian
distribution. Then we numerically study the behavior of mixture models with
more than two components. Motivated by our extensive numerical experiments, we
propose a novel stochastic method for estimating the mean of components of a
mixture model. Our numerical experiments show that our algorithm outperforms
the Naive EM algorithm in almost all scenarios
Estimating the Coefficients of a Mixture of Two Linear Regressions by Expectation Maximization
We give convergence guarantees for estimating the coefficients of a symmetric
mixture of two linear regressions by expectation maximization (EM). In
particular, we show that the empirical EM iterates converge to the target
parameter vector at the parametric rate, provided the algorithm is initialized
in an unbounded cone. In particular, if the initial guess has a sufficiently
large cosine angle with the target parameter vector, a sample-splitting version
of the EM algorithm converges to the true coefficient vector with high
probability. Interestingly, our analysis borrows from tools used in the problem
of estimating the centers of a symmetric mixture of two Gaussians by EM. We
also show that the population EM operator for mixtures of two regressions is
anti-contractive from the target parameter vector if the cosine angle between
the input vector and the target parameter vector is too small, thereby
establishing the necessity of our conic condition. Finally, we give empirical
evidence supporting this theoretical observation, which suggests that the
sample based EM algorithm performs poorly when initial guesses are drawn
accordingly. Our simulation study also suggests that the EM algorithm performs
well even under model misspecification (i.e., when the covariate and error
distributions violate the model assumptions)
On the Analysis of EM for truncated mixtures of two Gaussians
Motivated by a recent result of Daskalakis et al. 2018, we analyze the
population version of Expectation-Maximization (EM) algorithm for the case of
\textit{truncated} mixtures of two Gaussians. Truncated samples from a
-dimensional mixture of two Gaussians means that a
sample is only revealed if it falls in some subset of
positive (Lebesgue) measure. We show that for , EM converges almost surely
(under random initialization) to the true mean (variance is known)
for any measurable set . Moreover, for we show EM almost surely
converges to the true mean for any measurable set when the map of EM has
only three fixed points, namely (covariance
matrix is known), and prove local convergence if there are more
than three fixed points. We also provide convergence rates of our findings. Our
techniques deviate from those of Daskalakis et al. 2017, which heavily depend
on symmetry that the untruncated problem exhibits. For example, for an
arbitrary measurable set , it is impossible to compute a closed form of the
update rule of EM. Moreover, arbitrarily truncating the mixture, induces
further correlations among the variables. We circumvent these challenges by
using techniques from dynamical systems, probability and statistics; implicit
function theorem, stability analysis around the fixed points of the update rule
of EM and correlation inequalities (FKG).Comment: Appeared in ALT 2020. Last version fixes statement about rates for
single dimensional cas
Statistical Convergence of the EM Algorithm on Gaussian Mixture Models
We study the convergence behavior of the Expectation Maximization (EM)
algorithm on Gaussian mixture models with an arbitrary number of mixture
components and mixing weights. We show that as long as the means of the
components are separated by at least , where is
the number of components and is the dimension, the EM algorithm converges
locally to the global optimum of the log-likelihood. Further, we show that the
convergence rate is linear and characterize the size of the basin of attraction
to the global optimum
Learning Mixture of Gaussians with Streaming Data
In this paper, we study the problem of learning a mixture of Gaussians with
streaming data: given a stream of points in dimensions generated by an
unknown mixture of spherical Gaussians, the goal is to estimate the model
parameters using a single pass over the data stream. We analyze a streaming
version of the popular Lloyd's heuristic and show that the algorithm estimates
all the unknown centers of the component Gaussians accurately if they are
sufficiently separated. Assuming each pair of centers are distant
with and where is the maximum
variance of any Gaussian component, we show that asymptotically the algorithm
estimates the centers optimally (up to constants); our center separation
requirement matches the best known result for spherical Gaussians
\citep{vempalawang}. For finite samples, we show that a bias term based on the
initial estimate decreases at rate while variance
decreases at nearly optimal rate of .
Our analysis requires seeding the algorithm with a good initial estimate of
the true cluster centers for which we provide an online PCA based clustering
algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is
the optimal while space complexity of our algorithm is .
In addition to the bias and variance terms which tend to , the
hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to
the data distribution and hence incurs an approximation error that cannot be
avoided. However, by using a streaming version of the classical
(soft-thresholding-based) EM method that exploits the Gaussian distribution
explicitly, we show that for a mixture of two Gaussians the true means can be
estimated consistently, with estimation error decreasing at nearly optimal
rate, and tending to for .Comment: 20 pages, 1 figur
Alternating Minimization Converges Super-Linearly for Mixed Linear Regression
We address the problem of solving mixed random linear equations. We have
unlabeled observations coming from multiple linear regressions, and each
observation corresponds to exactly one of the regression models. The goal is to
learn the linear regressors from the observations. Classically, Alternating
Minimization (AM) (which is a variant of Expectation Maximization (EM)) is used
to solve this problem. AM iteratively alternates between the estimation of
labels and solving the regression problems with the estimated labels.
Empirically, it is observed that, for a large variety of non-convex problems
including mixed linear regression, AM converges at a much faster rate compared
to gradient based algorithms. However, the existing theory suggests similar
rate of convergence for AM and gradient based methods, failing to capture this
empirical behavior. In this paper, we close this gap between theory and
practice for the special case of a mixture of linear regressions. We show
that, provided initialized properly, AM enjoys a \emph{super-linear} rate of
convergence in certain parameter regimes. To the best of our knowledge, this is
the first work that theoretically establishes such rate for AM. Hence, if we
want to recover the unknown regressors upto an error (in norm) of
, AM only takes iterations.
Furthermore, we compare AM with a gradient based heuristic algorithm
empirically and show that AM dominates in iteration complexity as well as
wall-clock time.Comment: Accepted for publication at AISTATS, 202
EM Converges for a Mixture of Many Linear Regressions
We study the convergence of the Expectation-Maximization (EM) algorithm for
mixtures of linear regressions with an arbitrary number of components. We
show that as long as signal-to-noise ratio (SNR) is ,
well-initialized EM converges to the true regression parameters. Previous
results for have only established local convergence for the
noiseless setting, i.e., where SNR is infinitely large. Our results enlarge the
scope to the environment with noises, and notably, we establish a statistical
error rate that is independent of the norm (or pairwise distance) of the
regression parameters. In particular, our results imply exact recovery as
, in contrast to most previous local convergence results
for EM, where the statistical error scaled with the norm of parameters.
Standard moment-method approaches may be applied to guarantee we are in the
region where our local convergence guarantees apply.Comment: SNR, initialization conditions improved from previous versio
Clustering Semi-Random Mixtures of Gaussians
Gaussian mixture models (GMM) are the most widely used statistical model for
the -means clustering problem and form a popular framework for clustering in
machine learning and data analysis. In this paper, we propose a natural
semi-random model for -means clustering that generalizes the Gaussian
mixture model, and that we believe will be useful in identifying robust
algorithms. In our model, a semi-random adversary is allowed to make arbitrary
"monotone" or helpful changes to the data generated from the Gaussian mixture
model.
Our first contribution is a polynomial time algorithm that provably recovers
the ground-truth up to small classification error w.h.p., assuming certain
separation between the components. Perhaps surprisingly, the algorithm we
analyze is the popular Lloyd's algorithm for -means clustering that is the
method-of-choice in practice. Our second result complements the upper bound by
giving a nearly matching information-theoretic lower bound on the number of
misclassified points incurred by any -means clustering algorithm on the
semi-random model
Super-resolution multi-reference alignment
We study super-resolution multi-reference alignment, the problem of
estimating a signal from many circularly shifted, down-sampled, and noisy
observations. We focus on the low SNR regime, and show that a signal in
is uniquely determined when the number of samples per
observation is of the order of the square root of the signal's length
. Phrased more informally, one can square the resolution. This
result holds if the number of observations is proportional to at least
1/SNR. In contrast, with fewer observations recovery is impossible even
when the observations are not down-sampled (). The analysis combines tools
from statistical signal processing and invariant theory. We design an
expectation-maximization algorithm and demonstrate that it can super-resolve
the signal in challenging SNR regimes
An Efficient Framework for Clustered Federated Learning
We address the problem of federated learning (FL) where users are distributed
and partitioned into clusters. This setup captures settings where different
groups of users have their own objectives (learning tasks) but by aggregating
their data with others in the same cluster (same learning task), they can
leverage the strength in numbers in order to perform more efficient federated
learning. For this new framework of clustered federated learning, we propose
the Iterative Federated Clustering Algorithm (IFCA), which alternately
estimates the cluster identities of the users and optimizes model parameters
for the user clusters via gradient descent. We analyze the convergence rate of
this algorithm first in a linear model with squared loss and then for generic
strongly convex and smooth loss functions. We show that in both settings, with
good initialization, IFCA is guaranteed to converge, and discuss the optimality
of the statistical error rate. In particular, for the linear model with two
clusters, we can guarantee that our algorithm converges as long as the
initialization is slightly better than random. When the clustering structure is
ambiguous, we propose to train the models by combining IFCA with the weight
sharing technique in multi-task learning. In the experiments, we show that our
algorithm can succeed even if we relax the requirements on initialization with
random initialization and multiple restarts. We also present experimental
results showing that our algorithm is efficient in non-convex problems such as
neural networks. We demonstrate the benefits of IFCA over the baselines on
several clustered FL benchmarks.Comment: Preliminary results appeared at NeurIPS 202