14,814 research outputs found
Statistical physics, mixtures of distributions, and the EM algorithm
We show that there are strong relationships between approaches to optmization and learning based on statistical physics or mixtures of experts. In particular, the EM algorithm can be interpreted as converging either to a local maximum of the mixtures model or to a saddle point solution to the statistical physics system. An advantage of the statistical physics approach is that it naturally gives rise to a heuristic continuation method, deterministic annealing, for finding good solutions
EMMIXcskew: an R Package for the Fitting of a Mixture of Canonical Fundamental Skew t-Distributions
This paper presents an R package EMMIXcskew for the fitting of the canonical
fundamental skew t-distribution (CFUST) and finite mixtures of this
distribution (FM-CFUST) via maximum likelihood (ML). The CFUST distribution
provides a flexible family of models to handle non-normal data, with parameters
for capturing skewness and heavy-tails in the data. It formally encompasses the
normal, t, and skew-normal distributions as special and/or limiting cases. A
few other versions of the skew t-distributions are also nested within the CFUST
distribution. In this paper, an Expectation-Maximization (EM) algorithm is
described for computing the ML estimates of the parameters of the FM-CFUST
model, and different strategies for initializing the algorithm are discussed
and illustrated. The methodology is implemented in the EMMIXcskew package, and
examples are presented using two real datasets. The EMMIXcskew package contains
functions to fit the FM-CFUST model, including procedures for generating
different initial values. Additional features include random sample generation
and contour visualization in 2D and 3D
Scattering statistics of rock outcrops: Model-data comparisons and Bayesian inference using mixture distributions
The probability density function of the acoustic field amplitude scattered by
the seafloor was measured in a rocky environment off the coast of Norway using
a synthetic aperture sonar system, and is reported here in terms of the
probability of false alarm. Interpretation of the measurements focused on
finding appropriate class of statistical models (single versus two-component
mixture models), and on appropriate models within these two classes. It was
found that two-component mixture models performed better than single models.
The two mixture models that performed the best (and had a basis in the physics
of scattering) were a mixture between two K distributions, and a mixture
between a Rayleigh and generalized Pareto distribution. Bayes' theorem was used
to estimate the probability density function of the mixture model parameters.
It was found that the K-K mixture exhibits significant correlation between its
parameters. The mixture between the Rayleigh and generalized Pareto
distributions also had significant parameter correlation, but also contained
multiple modes. We conclude that the mixture between two K distributions is the
most applicable to this dataset.Comment: 15 pages, 7 figures, Accepted to the Journal of the Acoustical
Society of Americ
Finite mixtures of matrix-variate Poisson-log normal distributions for three-way count data
Three-way data structures, characterized by three entities, the units, the
variables and the occasions, are frequent in biological studies. In RNA
sequencing, three-way data structures are obtained when high-throughput
transcriptome sequencing data are collected for n genes across p conditions at
r occasions. Matrix-variate distributions offer a natural way to model
three-way data and mixtures of matrix-variate distributions can be used to
cluster three-way data. Clustering of gene expression data is carried out as
means to discovering gene co-expression networks. In this work, a mixture of
matrix-variate Poisson-log normal distributions is proposed for clustering read
counts from RNA sequencing. By considering the matrix-variate structure, full
information on the conditions and occasions of the RNA sequencing dataset is
simultaneously considered, and the number of covariance parameters to be
estimated is reduced. A Markov chain Monte Carlo expectation-maximization
algorithm is used for parameter estimation and information criteria are used
for model selection. The models are applied to both real and simulated data,
giving favourable clustering results
Scalable Text and Link Analysis with Mixed-Topic Link Models
Many data sets contain rich information about objects, as well as pairwise
relations between them. For instance, in networks of websites, scientific
papers, and other documents, each node has content consisting of a collection
of words, as well as hyperlinks or citations to other nodes. In order to
perform inference on such data sets, and make predictions and recommendations,
it is useful to have models that are able to capture the processes which
generate the text at each node and the links between them. In this paper, we
combine classic ideas in topic modeling with a variant of the mixed-membership
block model recently developed in the statistical physics community. The
resulting model has the advantage that its parameters, including the mixture of
topics of each document and the resulting overlapping communities, can be
inferred with a simple and scalable expectation-maximization algorithm. We test
our model on three data sets, performing unsupervised topic classification and
link prediction. For both tasks, our model outperforms several existing
state-of-the-art methods, achieving higher accuracy with significantly less
computation, analyzing a data set with 1.3 million words and 44 thousand links
in a few minutes.Comment: 11 pages, 4 figure
- …