973 research outputs found
Estimating Mixture Entropy with Pairwise Distances
Mixture distributions arise in many parametric and non-parametric settings --
for example, in Gaussian mixture models and in non-parametric estimation. It is
often necessary to compute the entropy of a mixture, but, in most cases, this
quantity has no closed-form expression, making some form of approximation
necessary. We propose a family of estimators based on a pairwise distance
function between mixture components, and show that this estimator class has
many attractive properties. For many distributions of interest, the proposed
estimators are efficient to compute, differentiable in the mixture parameters,
and become exact when the mixture components are clustered. We prove this
family includes lower and upper bounds on the mixture entropy. The Chernoff
-divergence gives a lower bound when chosen as the distance function,
with the Bhattacharyya distance providing the tightest lower bound for
components that are symmetric and members of a location family. The
Kullback-Leibler divergence gives an upper bound when used as the distance
function. We provide closed-form expressions of these bounds for mixtures of
Gaussians, and discuss their applications to the estimation of mutual
information. We then demonstrate that our bounds are significantly tighter than
well-known existing bounds using numeric simulations. This estimator class is
very useful in optimization problems involving maximization/minimization of
entropy and mutual information, such as MaxEnt and rate distortion problems.Comment: Corrects several errata in published version, in particular in
Section V (bounds on mutual information
Bayesian model comparison in cosmology with Population Monte Carlo
We use Bayesian model selection techniques to test extensions of the standard
flat LambdaCDM paradigm. Dark-energy and curvature scenarios, and primordial
perturbation models are considered. To that end, we calculate the Bayesian
evidence in favour of each model using Population Monte Carlo (PMC), a new
adaptive sampling technique which was recently applied in a cosmological
context. The Bayesian evidence is immediately available from the PMC sample
used for parameter estimation without further computational effort, and it
comes with an associated error evaluation. Besides, it provides an unbiased
estimator of the evidence after any fixed number of iterations and it is
naturally parallelizable, in contrast with MCMC and nested sampling methods. By
comparison with analytical predictions for simulated data, we show that our
results obtained with PMC are reliable and robust. The variability in the
evidence evaluation and the stability for various cases are estimated both from
simulations and from data. For the cases we consider, the log-evidence is
calculated with a precision of better than 0.08.
Using a combined set of recent CMB, SNIa and BAO data, we find inconclusive
evidence between flat LambdaCDM and simple dark-energy models. A curved
Universe is moderately to strongly disfavoured with respect to a flat
cosmology. Using physically well-motivated priors within the slow-roll
approximation of inflation, we find a weak preference for a running spectral
index. A Harrison-Zel'dovich spectrum is weakly disfavoured. With the current
data, tensor modes are not detected; the large prior volume on the
tensor-to-scalar ratio r results in moderate evidence in favour of r=0.
[Abridged]Comment: 11 pages, 6 figures. Matches version accepted for publication by
MNRA
Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance
Surrogate modeling and data-model convergence are important in any field utilizing probabilistic modeling, including High Energy Physics and Nuclear Physics. However, demonstrating that the model produces samples from the same underlying distribution as the true source can be problematic if the data is many-dimensional. The 1-D and multi-dimensional Kolmogorov-Smirnov test (ddKS) is a statistically powerful nonparametric test which can be implemented as a one- or two-sample test. We have developed three algorithms, one exact and two approximate, for the multi-dimensional Kolmogorov-Smirnov test proposed by Fasano. We apply ddKS to the comparison of photon distributions in the Belle II time-of-propagation detector using the collaboration’s Geant4 simulation and our own neural network surrogate model. Additionally, we have derived an analytic form for the statistical significance of ddKS. Our approximations reduce the input time complexity from quadratic to log-linear (vdKS) and reduce the dimensional time complexity from exponential to linear (rdKS). The approximation methods maintain the statistical power of the exact method requiring tens of data points to indicate differences between most sampled distributions
Bounds on mutual information of mixture data for classification tasks
The data for many classification problems, such as pattern and speech
recognition, follow mixture distributions. To quantify the optimum performance
for classification tasks, the Shannon mutual information is a natural
information-theoretic metric, as it is directly related to the probability of
error. The mutual information between mixture data and the class label does not
have an analytical expression, nor any efficient computational algorithms. We
introduce a variational upper bound, a lower bound, and three estimators, all
employing pair-wise divergences between mixture components. We compare the new
bounds and estimators with Monte Carlo stochastic sampling and bounds derived
from entropy bounds. To conclude, we evaluate the performance of the bounds and
estimators through numerical simulations
Statistically optimal continuous free energy surfaces from biased simulations and multistate reweighting
Free energies as a function of a selected set of collective variables are
commonly computed in molecular simulation and of significant value in
understanding and engineering molecular behavior. These free energy surfaces
are most commonly estimated using variants of histogramming techniques, but
such approaches obscure two important facets of these functions. First, the
empirical observations along the collective variable are defined by an ensemble
of discrete observations and the coarsening of these observations into a
histogram bins incurs unnecessary loss of information. Second, the free energy
surface is itself almost always a continuous function, and its representation
by a histogram introduces inherent approximations due to the discretization. In
this study, we relate the observed discrete observations from biased
simulations to the inferred underlying continuous probability distribution over
the collective variables and derive histogram-free techniques for estimating
this free energy surface. We reformulate free energy surface estimation as
minimization of a Kullback-Leibler divergence between a continuous trial
function and the discrete empirical distribution and show that this is
equivalent to likelihood maximization of a trial function given a set of
sampled data. We then present a fully Bayesian treatment of this formalism,
which enables the incorporation of powerful Bayesian tools such as the
inclusion of regularizing priors, uncertainty quantification, and model
selection techniques. We demonstrate this new formalism in the analysis of
umbrella sampling simulations for the torsion of a valine sidechain in
the L99A mutant of T4 lysozyme with benzene bound in the cavity.Comment: 24 pages, 5 figure
- …