31 research outputs found

    On the Behavior of the Expectation-Maximization Algorithm for Mixture Models

    Full text link
    Finite mixture models are among the most popular statistical models used in different data science disciplines. Despite their broad applicability, inference under these models typically leads to computationally challenging non-convex problems. While the Expectation-Maximization (EM) algorithm is the most popular approach for solving these non-convex problems, the behavior of this algorithm is not well understood. In this work, we focus on the case of mixture of Laplacian (or Gaussian) distribution. We start by analyzing a simple equally weighted mixture of two single dimensional Laplacian distributions and show that every local optimum of the population maximum likelihood estimation problem is globally optimal. Then, we prove that the EM algorithm converges to the ground truth parameters almost surely with random initialization. Our result extends the existing results for Gaussian distribution to Laplacian distribution. Then we numerically study the behavior of mixture models with more than two components. Motivated by our extensive numerical experiments, we propose a novel stochastic method for estimating the mean of components of a mixture model. Our numerical experiments show that our algorithm outperforms the Naive EM algorithm in almost all scenarios

    Estimating the Coefficients of a Mixture of Two Linear Regressions by Expectation Maximization

    Full text link
    We give convergence guarantees for estimating the coefficients of a symmetric mixture of two linear regressions by expectation maximization (EM). In particular, we show that the empirical EM iterates converge to the target parameter vector at the parametric rate, provided the algorithm is initialized in an unbounded cone. In particular, if the initial guess has a sufficiently large cosine angle with the target parameter vector, a sample-splitting version of the EM algorithm converges to the true coefficient vector with high probability. Interestingly, our analysis borrows from tools used in the problem of estimating the centers of a symmetric mixture of two Gaussians by EM. We also show that the population EM operator for mixtures of two regressions is anti-contractive from the target parameter vector if the cosine angle between the input vector and the target parameter vector is too small, thereby establishing the necessity of our conic condition. Finally, we give empirical evidence supporting this theoretical observation, which suggests that the sample based EM algorithm performs poorly when initial guesses are drawn accordingly. Our simulation study also suggests that the EM algorithm performs well even under model misspecification (i.e., when the covariate and error distributions violate the model assumptions)

    On the Analysis of EM for truncated mixtures of two Gaussians

    Full text link
    Motivated by a recent result of Daskalakis et al. 2018, we analyze the population version of Expectation-Maximization (EM) algorithm for the case of \textit{truncated} mixtures of two Gaussians. Truncated samples from a dd-dimensional mixture of two Gaussians 12N(μ,Σ)+12N(μ,Σ)\frac{1}{2} \mathcal{N}(\vec{\mu}, \vec{\Sigma})+ \frac{1}{2} \mathcal{N}(-\vec{\mu}, \vec{\Sigma}) means that a sample is only revealed if it falls in some subset SRdS \subset \mathbb{R}^d of positive (Lebesgue) measure. We show that for d=1d=1, EM converges almost surely (under random initialization) to the true mean (variance σ2\sigma^2 is known) for any measurable set SS. Moreover, for d>1d>1 we show EM almost surely converges to the true mean for any measurable set SS when the map of EM has only three fixed points, namely μ,0,μ-\vec{\mu}, \vec{0}, \vec{\mu} (covariance matrix Σ\vec{\Sigma} is known), and prove local convergence if there are more than three fixed points. We also provide convergence rates of our findings. Our techniques deviate from those of Daskalakis et al. 2017, which heavily depend on symmetry that the untruncated problem exhibits. For example, for an arbitrary measurable set SS, it is impossible to compute a closed form of the update rule of EM. Moreover, arbitrarily truncating the mixture, induces further correlations among the variables. We circumvent these challenges by using techniques from dynamical systems, probability and statistics; implicit function theorem, stability analysis around the fixed points of the update rule of EM and correlation inequalities (FKG).Comment: Appeared in ALT 2020. Last version fixes statement about rates for single dimensional cas

    Statistical Convergence of the EM Algorithm on Gaussian Mixture Models

    Full text link
    We study the convergence behavior of the Expectation Maximization (EM) algorithm on Gaussian mixture models with an arbitrary number of mixture components and mixing weights. We show that as long as the means of the components are separated by at least Ω(min{M,d})\Omega(\sqrt{\min\{M,d\}}), where MM is the number of components and dd is the dimension, the EM algorithm converges locally to the global optimum of the log-likelihood. Further, we show that the convergence rate is linear and characterize the size of the basin of attraction to the global optimum

    Learning Mixture of Gaussians with Streaming Data

    Full text link
    In this paper, we study the problem of learning a mixture of Gaussians with streaming data: given a stream of NN points in dd dimensions generated by an unknown mixture of kk spherical Gaussians, the goal is to estimate the model parameters using a single pass over the data stream. We analyze a streaming version of the popular Lloyd's heuristic and show that the algorithm estimates all the unknown centers of the component Gaussians accurately if they are sufficiently separated. Assuming each pair of centers are CσC\sigma distant with C=Ω((klogk)1/4σ)C=\Omega((k\log k)^{1/4}\sigma) and where σ2\sigma^2 is the maximum variance of any Gaussian component, we show that asymptotically the algorithm estimates the centers optimally (up to constants); our center separation requirement matches the best known result for spherical Gaussians \citep{vempalawang}. For finite samples, we show that a bias term based on the initial estimate decreases at O(1/poly(N))O(1/{\rm poly}(N)) rate while variance decreases at nearly optimal rate of σ2d/N\sigma^2 d/N. Our analysis requires seeding the algorithm with a good initial estimate of the true cluster centers for which we provide an online PCA based clustering algorithm. Indeed, the asymptotic per-step time complexity of our algorithm is the optimal dkd\cdot k while space complexity of our algorithm is O(dklogk)O(dk\log k). In addition to the bias and variance terms which tend to 00, the hard-thresholding based updates of streaming Lloyd's algorithm is agnostic to the data distribution and hence incurs an approximation error that cannot be avoided. However, by using a streaming version of the classical (soft-thresholding-based) EM method that exploits the Gaussian distribution explicitly, we show that for a mixture of two Gaussians the true means can be estimated consistently, with estimation error decreasing at nearly optimal rate, and tending to 00 for NN\rightarrow \infty.Comment: 20 pages, 1 figur

    Alternating Minimization Converges Super-Linearly for Mixed Linear Regression

    Full text link
    We address the problem of solving mixed random linear equations. We have unlabeled observations coming from multiple linear regressions, and each observation corresponds to exactly one of the regression models. The goal is to learn the linear regressors from the observations. Classically, Alternating Minimization (AM) (which is a variant of Expectation Maximization (EM)) is used to solve this problem. AM iteratively alternates between the estimation of labels and solving the regression problems with the estimated labels. Empirically, it is observed that, for a large variety of non-convex problems including mixed linear regression, AM converges at a much faster rate compared to gradient based algorithms. However, the existing theory suggests similar rate of convergence for AM and gradient based methods, failing to capture this empirical behavior. In this paper, we close this gap between theory and practice for the special case of a mixture of 22 linear regressions. We show that, provided initialized properly, AM enjoys a \emph{super-linear} rate of convergence in certain parameter regimes. To the best of our knowledge, this is the first work that theoretically establishes such rate for AM. Hence, if we want to recover the unknown regressors upto an error (in 2\ell_2 norm) of ϵ\epsilon, AM only takes O(loglog(1/ϵ))\mathcal{O}(\log \log (1/\epsilon)) iterations. Furthermore, we compare AM with a gradient based heuristic algorithm empirically and show that AM dominates in iteration complexity as well as wall-clock time.Comment: Accepted for publication at AISTATS, 202

    EM Converges for a Mixture of Many Linear Regressions

    Full text link
    We study the convergence of the Expectation-Maximization (EM) algorithm for mixtures of linear regressions with an arbitrary number kk of components. We show that as long as signal-to-noise ratio (SNR) is Ω~(k)\tilde{\Omega}(k), well-initialized EM converges to the true regression parameters. Previous results for k3k \geq 3 have only established local convergence for the noiseless setting, i.e., where SNR is infinitely large. Our results enlarge the scope to the environment with noises, and notably, we establish a statistical error rate that is independent of the norm (or pairwise distance) of the regression parameters. In particular, our results imply exact recovery as σ0\sigma \rightarrow 0, in contrast to most previous local convergence results for EM, where the statistical error scaled with the norm of parameters. Standard moment-method approaches may be applied to guarantee we are in the region where our local convergence guarantees apply.Comment: SNR, initialization conditions improved from previous versio

    Clustering Semi-Random Mixtures of Gaussians

    Full text link
    Gaussian mixture models (GMM) are the most widely used statistical model for the kk-means clustering problem and form a popular framework for clustering in machine learning and data analysis. In this paper, we propose a natural semi-random model for kk-means clustering that generalizes the Gaussian mixture model, and that we believe will be useful in identifying robust algorithms. In our model, a semi-random adversary is allowed to make arbitrary "monotone" or helpful changes to the data generated from the Gaussian mixture model. Our first contribution is a polynomial time algorithm that provably recovers the ground-truth up to small classification error w.h.p., assuming certain separation between the components. Perhaps surprisingly, the algorithm we analyze is the popular Lloyd's algorithm for kk-means clustering that is the method-of-choice in practice. Our second result complements the upper bound by giving a nearly matching information-theoretic lower bound on the number of misclassified points incurred by any kk-means clustering algorithm on the semi-random model

    Super-resolution multi-reference alignment

    Full text link
    We study super-resolution multi-reference alignment, the problem of estimating a signal from many circularly shifted, down-sampled, and noisy observations. We focus on the low SNR regime, and show that a signal in RM\mathbb{R}^M is uniquely determined when the number LL of samples per observation is of the order of the square root of the signal's length (L=O(M))(L=O(\sqrt{M})). Phrased more informally, one can square the resolution. This result holds if the number of observations is proportional to at least 1/SNR3^3. In contrast, with fewer observations recovery is impossible even when the observations are not down-sampled (L=ML=M). The analysis combines tools from statistical signal processing and invariant theory. We design an expectation-maximization algorithm and demonstrate that it can super-resolve the signal in challenging SNR regimes

    An Efficient Framework for Clustered Federated Learning

    Full text link
    We address the problem of federated learning (FL) where users are distributed and partitioned into clusters. This setup captures settings where different groups of users have their own objectives (learning tasks) but by aggregating their data with others in the same cluster (same learning task), they can leverage the strength in numbers in order to perform more efficient federated learning. For this new framework of clustered federated learning, we propose the Iterative Federated Clustering Algorithm (IFCA), which alternately estimates the cluster identities of the users and optimizes model parameters for the user clusters via gradient descent. We analyze the convergence rate of this algorithm first in a linear model with squared loss and then for generic strongly convex and smooth loss functions. We show that in both settings, with good initialization, IFCA is guaranteed to converge, and discuss the optimality of the statistical error rate. In particular, for the linear model with two clusters, we can guarantee that our algorithm converges as long as the initialization is slightly better than random. When the clustering structure is ambiguous, we propose to train the models by combining IFCA with the weight sharing technique in multi-task learning. In the experiments, we show that our algorithm can succeed even if we relax the requirements on initialization with random initialization and multiple restarts. We also present experimental results showing that our algorithm is efficient in non-convex problems such as neural networks. We demonstrate the benefits of IFCA over the baselines on several clustered FL benchmarks.Comment: Preliminary results appeared at NeurIPS 202
    corecore