    Manifold Optimization for Gaussian Mixture Models

    We take a new look at parameter estimation for Gaussian Mixture Models (GMMs). In particular, we propose using \emph{Riemannian manifold optimization} as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box invocation of manifold optimization, however, fails spectacularly: it converges to the same solution but vastly slower. Driven by intuition from manifold convexity, we then propose a reparamerization that has remarkable empirical consequences. It makes manifold optimization not only match EM---a highly encouraging result in itself given the poor record nonlinear programming methods have had against EM so far---but also outperform EM in many practical settings, while displaying much less variability in running times. We further highlight the strengths of manifold optimization by developing a somewhat tuned manifold LBFGS method that proves even more competitive and reliable than existing manifold optimization tools. We hope that our results encourage a wider consideration of manifold optimization for parameter estimation problems.Comment: 19 page

    An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization

    We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms). This task is almost invariably solved (in theory and practice) via the Expectation Maximization (EM) algorithm. EM owes its success to various factors, of which is its ability to fulfill positive definiteness constraints in closed form is of key importance. We propose an alternative to EM by appealing to the rich Riemannian geometry of positive definite matrices, using which we cast Gmm parameter estimation as a Riemannian optimization problem. Surprisingly, such an out-of-the-box Riemannian formulation completely fails and proves much inferior to EM. This motivates us to take a closer look at the problem geometry, and derive a better formulation that is much more amenable to Riemannian optimization. We then develop (Riemannian) batch and stochastic gradient algorithms that outperform EM, often substantially. We provide a non-asymptotic convergence analysis for our stochastic method, which is also the first (to our knowledge) such global analysis for Riemannian stochastic gradient. Numerous empirical results are included to demonstrate the effectiveness of our methods.Comment: 21 pages, 6 figure

    MixEst: An Estimation Toolbox for Mixture Models

    Mixture models are powerful statistical models used in many applications ranging from density estimation to clustering and classification. When dealing with mixture models, there are many issues that the experimenter should be aware of and needs to solve. The MixEst toolbox is a powerful and user-friendly package for MATLAB that implements several state-of-the-art approaches to address these problems. Additionally, MixEst gives the possibility of using manifold optimization for fitting the density model, a feature specific to this toolbox. MixEst simplifies using and integration of mixture models in statistical models and applications. For developing mixture models of new densities, the user just needs to provide a few functions for that statistical distribution and the toolbox takes care of all the issues regarding mixture models. MixEst is available at visionlab.ut.ac.ir/mixest and is fully documented and is licensed under GPL.Comment: 5 page

    Mixtures of Multivariate Power Exponential Distributions

    An expanded family of mixtures of multivariate power exponential distributions is introduced. While fitting heavy-tails and skewness has received much attention in the model-based clustering literature recently, we investigate the use of a distribution that can deal with both varying tail-weight and peakedness of data. A family of parsimonious models is proposed using an eigen-decomposition of the scale matrix. A generalized expectation-maximization algorithm is presented that combines convex optimization via a minorization-maximization approach and optimization based on accelerated line search algorithms on the Stiefel manifold. Lastly, the utility of this family of models is illustrated using both toy and benchmark data

    A review of mean-shift algorithms for clustering

    A natural way to characterize the cluster structure of a dataset is by finding regions containing a high density of data. This can be done in a nonparametric way with a kernel density estimate, whose modes and hence clusters can be found using mean-shift algorithms. We describe the theory and practice behind clustering based on kernel density estimates and mean-shift algorithms. We discuss the blurring and non-blurring versions of mean-shift; theoretical results about mean-shift algorithms and Gaussian mixtures; relations with scale-space theory, spectral clustering and other algorithms; extensions to tracking, to manifold and graph data, and to manifold denoising; K-modes and Laplacian K-modes algorithms; acceleration strategies for large datasets; and applications to image segmentation, manifold denoising and multivalued regression.Comment: 28 pages, 9 figures. Invited book chapter to appear in the CRC Handbook of Cluster Analysis (eds. Roberto Rocci, Fionn Murtagh, Marina Meila and Christian Hennig

    Free Component Analysis: Theory, Algorithms & Applications

    We describe a method for unmixing mixtures of freely independent random variables in a manner analogous to the independent component analysis (ICA) based method for unmixing independent random variables from their additive mixtures. Random matrices play the role of free random variables in this context so the method we develop, which we call Free component analysis (FCA), unmixes matrices from additive mixtures of matrices. Thus, while the mixing model is standard, the novelty and difference in unmixing performance comes from the introduction of a new statistical criteria, derived from free probability theory, that quantify freeness analogous to how kurtosis and entropy quantify independence. We describe the theory, the various algorithms, and compare FCA to vanilla ICA which does not account for spatial or temporal structure. We highlight why the statistical criteria make FCA also vanilla despite its matricial underpinnings and show that FCA performs comparably to, and often better than, (vanilla) ICA in every application, such as image and speech unmixing, where ICA has been known to succeed. Our computational experiments suggest that not-so-random matrices, such as images and spectrograms of waveforms are (closer to being) freer "in the wild" than we might have theoretically expected.Comment: 68 pages, 16 figure

    When Gaussian Process Meets Big Data: A Review of Scalable GPs

    The vast quantity of information brought by big data as well as the evolving computer hardware encourages success stories in the machine learning community. In the meanwhile, it poses challenges for the Gaussian process (GP) regression, a well-known non-parametric and interpretable Bayesian model, which suffers from cubic complexity to data size. To improve the scalability while retaining desirable prediction quality, a variety of scalable GPs have been presented. But they have not yet been comprehensively reviewed and analyzed in order to be well understood by both academia and industry. The review of scalable GPs in the GP community is timely and important due to the explosion of data size. To this end, this paper is devoted to the review on state-of-the-art scalable GPs involving two main categories: global approximations which distillate the entire data and local approximations which divide the data for subspace learning. Particularly, for global approximations, we mainly focus on sparse approximations comprising prior approximations which modify the prior but perform exact inference, posterior approximations which retain exact prior but perform approximate inference, and structured sparse approximations which exploit specific structures in kernel matrix; for local approximations, we highlight the mixture/product of experts that conducts model averaging from multiple local experts to boost predictions. To present a complete review, recent advances for improving the scalability and capability of scalable GPs are reviewed. Finally, the extensions and open issues regarding the implementation of scalable GPs in various scenarios are reviewed and discussed to inspire novel ideas for future research avenues.Comment: 20 pages, 6 figure

    Out-of-Sample Extension for Dimensionality Reduction of Noisy Time Series

    This paper proposes an out-of-sample extension framework for a global manifold learning algorithm (Isomap) that uses temporal information in out-of-sample points in order to make the embedding more robust to noise and artifacts. Given a set of noise-free training data and its embedding, the proposed framework extends the embedding for a noisy time series. This is achieved by adding a spatio-temporal compactness term to the optimization objective of the embedding. To the best of our knowledge, this is the first method for out-of-sample extension of manifold embeddings that leverages timing information available for the extension set. Experimental results demonstrate that our out-of-sample extension algorithm renders a more robust and accurate embedding of sequentially ordered image data in the presence of various noise and artifacts when compared to other timing-aware embeddings. Additionally, we show that an out-of-sample extension framework based on the proposed algorithm outperforms the state of the art in eye-gaze estimation

    On ww-mixtures: Finite convex combinations of prescribed component distributions

    We consider the space of ww-mixtures which is defined as the set of finite statistical mixtures sharing the same prescribed component distributions closed under convex combinations. The information geometry induced by the Bregman generator set to the Shannon negentropy on this space yields a dually flat space called the mixture family manifold. We show how the Kullback-Leibler (KL) divergence can be recovered from the corresponding Bregman divergence for the negentropy generator: That is, the KL divergence between two ww-mixtures amounts to a Bregman Divergence (BD) induced by the Shannon negentropy generator. Thus the KL divergence between two Gaussian Mixture Models (GMMs) sharing the same Gaussian components is equivalent to a Bregman divergence. This KL-BD equivalence on a mixture family manifold implies that we can perform optimal KL-averaging aggregation of ww-mixtures without information loss. More generally, we prove that the statistical skew Jensen-Shannon divergence between ww-mixtures is equivalent to a skew Jensen divergence between their corresponding parameters. Finally, we state several properties, divergence identities, and inequalities relating to ww-mixtures.Comment: 31 pages, extend a preliminary paper (ICASSP 2018

    Riemannian Gaussian Distributions on the Space of Symmetric Positive Definite Matrices

    Data which lie in the space Pm\mathcal{P}_{m\,}, of m×mm \times m symmetric positive definite matrices, (sometimes called tensor data), play a fundamental role in applications including medical imaging, computer vision, and radar signal processing. An open challenge, for these applications, is to find a class of probability distributions, which is able to capture the statistical properties of data in Pm\mathcal{P}_{m\,}, as they arise in real-world situations. The present paper meets this challenge by introducing Riemannian Gaussian distributions on Pm\mathcal{P}_{m\,}. Distributions of this kind were first considered by Pennec in 20062006. However, the present paper gives an exact expression of their probability density function for the first time in existing literature. This leads to two original contributions. First, a detailed study of statistical inference for Riemannian Gaussian distributions, uncovering the connection between maximum likelihood estimation and the concept of Riemannian centre of mass, widely used in applications. Second, the derivation and implementation of an expectation-maximisation algorithm, for the estimation of mixtures of Riemannian Gaussian distributions. The paper applies this new algorithm, to the classification of data in Pm\mathcal{P}_{m\,}, (concretely, to the problem of texture classification, in computer vision), showing that it yields significantly better performance, in comparison to recent approaches.Comment: 21 pages, 1 table; accepted for publication in IEEE Trans Inf Theor