6,568 research outputs found
Manifold Optimization for Gaussian Mixture Models
We take a new look at parameter estimation for Gaussian Mixture Models
(GMMs). In particular, we propose using \emph{Riemannian manifold optimization}
as a powerful counterpart to Expectation Maximization (EM). An out-of-the-box
invocation of manifold optimization, however, fails spectacularly: it converges
to the same solution but vastly slower. Driven by intuition from manifold
convexity, we then propose a reparamerization that has remarkable empirical
consequences. It makes manifold optimization not only match EM---a highly
encouraging result in itself given the poor record nonlinear programming
methods have had against EM so far---but also outperform EM in many practical
settings, while displaying much less variability in running times. We further
highlight the strengths of manifold optimization by developing a somewhat tuned
manifold LBFGS method that proves even more competitive and reliable than
existing manifold optimization tools. We hope that our results encourage a
wider consideration of manifold optimization for parameter estimation problems.Comment: 19 page
An Alternative to EM for Gaussian Mixture Models: Batch and Stochastic Riemannian Optimization
We consider maximum likelihood estimation for Gaussian Mixture Models (Gmms).
This task is almost invariably solved (in theory and practice) via the
Expectation Maximization (EM) algorithm. EM owes its success to various
factors, of which is its ability to fulfill positive definiteness constraints
in closed form is of key importance. We propose an alternative to EM by
appealing to the rich Riemannian geometry of positive definite matrices, using
which we cast Gmm parameter estimation as a Riemannian optimization problem.
Surprisingly, such an out-of-the-box Riemannian formulation completely fails
and proves much inferior to EM. This motivates us to take a closer look at the
problem geometry, and derive a better formulation that is much more amenable to
Riemannian optimization. We then develop (Riemannian) batch and stochastic
gradient algorithms that outperform EM, often substantially. We provide a
non-asymptotic convergence analysis for our stochastic method, which is also
the first (to our knowledge) such global analysis for Riemannian stochastic
gradient. Numerous empirical results are included to demonstrate the
effectiveness of our methods.Comment: 21 pages, 6 figure
MixEst: An Estimation Toolbox for Mixture Models
Mixture models are powerful statistical models used in many applications
ranging from density estimation to clustering and classification. When dealing
with mixture models, there are many issues that the experimenter should be
aware of and needs to solve. The MixEst toolbox is a powerful and user-friendly
package for MATLAB that implements several state-of-the-art approaches to
address these problems. Additionally, MixEst gives the possibility of using
manifold optimization for fitting the density model, a feature specific to this
toolbox. MixEst simplifies using and integration of mixture models in
statistical models and applications. For developing mixture models of new
densities, the user just needs to provide a few functions for that statistical
distribution and the toolbox takes care of all the issues regarding mixture
models. MixEst is available at visionlab.ut.ac.ir/mixest and is fully
documented and is licensed under GPL.Comment: 5 page
Mixtures of Multivariate Power Exponential Distributions
An expanded family of mixtures of multivariate power exponential
distributions is introduced. While fitting heavy-tails and skewness has
received much attention in the model-based clustering literature recently, we
investigate the use of a distribution that can deal with both varying
tail-weight and peakedness of data. A family of parsimonious models is proposed
using an eigen-decomposition of the scale matrix. A generalized
expectation-maximization algorithm is presented that combines convex
optimization via a minorization-maximization approach and optimization based on
accelerated line search algorithms on the Stiefel manifold. Lastly, the utility
of this family of models is illustrated using both toy and benchmark data
A review of mean-shift algorithms for clustering
A natural way to characterize the cluster structure of a dataset is by
finding regions containing a high density of data. This can be done in a
nonparametric way with a kernel density estimate, whose modes and hence
clusters can be found using mean-shift algorithms. We describe the theory and
practice behind clustering based on kernel density estimates and mean-shift
algorithms. We discuss the blurring and non-blurring versions of mean-shift;
theoretical results about mean-shift algorithms and Gaussian mixtures;
relations with scale-space theory, spectral clustering and other algorithms;
extensions to tracking, to manifold and graph data, and to manifold denoising;
K-modes and Laplacian K-modes algorithms; acceleration strategies for large
datasets; and applications to image segmentation, manifold denoising and
multivalued regression.Comment: 28 pages, 9 figures. Invited book chapter to appear in the CRC
Handbook of Cluster Analysis (eds. Roberto Rocci, Fionn Murtagh, Marina Meila
and Christian Hennig
Free Component Analysis: Theory, Algorithms & Applications
We describe a method for unmixing mixtures of freely independent random
variables in a manner analogous to the independent component analysis (ICA)
based method for unmixing independent random variables from their additive
mixtures. Random matrices play the role of free random variables in this
context so the method we develop, which we call Free component analysis (FCA),
unmixes matrices from additive mixtures of matrices. Thus, while the mixing
model is standard, the novelty and difference in unmixing performance comes
from the introduction of a new statistical criteria, derived from free
probability theory, that quantify freeness analogous to how kurtosis and
entropy quantify independence. We describe the theory, the various algorithms,
and compare FCA to vanilla ICA which does not account for spatial or temporal
structure. We highlight why the statistical criteria make FCA also vanilla
despite its matricial underpinnings and show that FCA performs comparably to,
and often better than, (vanilla) ICA in every application, such as image and
speech unmixing, where ICA has been known to succeed. Our computational
experiments suggest that not-so-random matrices, such as images and
spectrograms of waveforms are (closer to being) freer "in the wild" than we
might have theoretically expected.Comment: 68 pages, 16 figure
When Gaussian Process Meets Big Data: A Review of Scalable GPs
The vast quantity of information brought by big data as well as the evolving
computer hardware encourages success stories in the machine learning community.
In the meanwhile, it poses challenges for the Gaussian process (GP) regression,
a well-known non-parametric and interpretable Bayesian model, which suffers
from cubic complexity to data size. To improve the scalability while retaining
desirable prediction quality, a variety of scalable GPs have been presented.
But they have not yet been comprehensively reviewed and analyzed in order to be
well understood by both academia and industry. The review of scalable GPs in
the GP community is timely and important due to the explosion of data size. To
this end, this paper is devoted to the review on state-of-the-art scalable GPs
involving two main categories: global approximations which distillate the
entire data and local approximations which divide the data for subspace
learning. Particularly, for global approximations, we mainly focus on sparse
approximations comprising prior approximations which modify the prior but
perform exact inference, posterior approximations which retain exact prior but
perform approximate inference, and structured sparse approximations which
exploit specific structures in kernel matrix; for local approximations, we
highlight the mixture/product of experts that conducts model averaging from
multiple local experts to boost predictions. To present a complete review,
recent advances for improving the scalability and capability of scalable GPs
are reviewed. Finally, the extensions and open issues regarding the
implementation of scalable GPs in various scenarios are reviewed and discussed
to inspire novel ideas for future research avenues.Comment: 20 pages, 6 figure
Out-of-Sample Extension for Dimensionality Reduction of Noisy Time Series
This paper proposes an out-of-sample extension framework for a global
manifold learning algorithm (Isomap) that uses temporal information in
out-of-sample points in order to make the embedding more robust to noise and
artifacts. Given a set of noise-free training data and its embedding, the
proposed framework extends the embedding for a noisy time series. This is
achieved by adding a spatio-temporal compactness term to the optimization
objective of the embedding. To the best of our knowledge, this is the first
method for out-of-sample extension of manifold embeddings that leverages timing
information available for the extension set. Experimental results demonstrate
that our out-of-sample extension algorithm renders a more robust and accurate
embedding of sequentially ordered image data in the presence of various noise
and artifacts when compared to other timing-aware embeddings. Additionally, we
show that an out-of-sample extension framework based on the proposed algorithm
outperforms the state of the art in eye-gaze estimation
On -mixtures: Finite convex combinations of prescribed component distributions
We consider the space of -mixtures which is defined as the set of finite
statistical mixtures sharing the same prescribed component distributions closed
under convex combinations. The information geometry induced by the Bregman
generator set to the Shannon negentropy on this space yields a dually flat
space called the mixture family manifold. We show how the Kullback-Leibler (KL)
divergence can be recovered from the corresponding Bregman divergence for the
negentropy generator: That is, the KL divergence between two -mixtures
amounts to a Bregman Divergence (BD) induced by the Shannon negentropy
generator. Thus the KL divergence between two Gaussian Mixture Models (GMMs)
sharing the same Gaussian components is equivalent to a Bregman divergence.
This KL-BD equivalence on a mixture family manifold implies that we can perform
optimal KL-averaging aggregation of -mixtures without information loss. More
generally, we prove that the statistical skew Jensen-Shannon divergence between
-mixtures is equivalent to a skew Jensen divergence between their
corresponding parameters. Finally, we state several properties, divergence
identities, and inequalities relating to -mixtures.Comment: 31 pages, extend a preliminary paper (ICASSP 2018
Riemannian Gaussian Distributions on the Space of Symmetric Positive Definite Matrices
Data which lie in the space , of symmetric
positive definite matrices, (sometimes called tensor data), play a fundamental
role in applications including medical imaging, computer vision, and radar
signal processing. An open challenge, for these applications, is to find a
class of probability distributions, which is able to capture the statistical
properties of data in , as they arise in real-world
situations. The present paper meets this challenge by introducing Riemannian
Gaussian distributions on . Distributions of this kind were
first considered by Pennec in . However, the present paper gives an exact
expression of their probability density function for the first time in existing
literature. This leads to two original contributions. First, a detailed study
of statistical inference for Riemannian Gaussian distributions, uncovering the
connection between maximum likelihood estimation and the concept of Riemannian
centre of mass, widely used in applications. Second, the derivation and
implementation of an expectation-maximisation algorithm, for the estimation of
mixtures of Riemannian Gaussian distributions. The paper applies this new
algorithm, to the classification of data in , (concretely,
to the problem of texture classification, in computer vision), showing that it
yields significantly better performance, in comparison to recent approaches.Comment: 21 pages, 1 table; accepted for publication in IEEE Trans Inf Theor
- …