20,176 research outputs found
Multiresolution analysis in statistical mechanics. I. Using wavelets to calculate thermodynamic properties
The wavelet transform, a family of orthonormal bases, is introduced as a
technique for performing multiresolution analysis in statistical mechanics. The
wavelet transform is a hierarchical technique designed to separate data sets
into sets representing local averages and local differences. Although
one-to-one transformations of data sets are possible, the advantage of the
wavelet transform is as an approximation scheme for the efficient calculation
of thermodynamic and ensemble properties. Even under the most drastic of
approximations, the resulting errors in the values obtained for average
absolute magnetization, free energy, and heat capacity are on the order of 10%,
with a corresponding computational efficiency gain of two orders of magnitude
for a system such as a Ising lattice. In addition, the errors in
the results tend toward zero in the neighborhood of fixed points, as determined
by renormalization group theory.Comment: 13 pages plus 7 figures (PNG
Maximum Fidelity
The most fundamental problem in statistics is the inference of an unknown
probability distribution from a finite number of samples. For a specific
observed data set, answers to the following questions would be desirable: (1)
Estimation: Which candidate distribution provides the best fit to the observed
data?, (2) Goodness-of-fit: How concordant is this distribution with the
observed data?, and (3) Uncertainty: How concordant are other candidate
distributions with the observed data? A simple unified approach for univariate
data that addresses these traditionally distinct statistical notions is
presented called "maximum fidelity". Maximum fidelity is a strict frequentist
approach that is fundamentally based on model concordance with the observed
data. The fidelity statistic is a general information measure based on the
coordinate-independent cumulative distribution and critical yet previously
neglected symmetry considerations. An approximation for the null distribution
of the fidelity allows its direct conversion to absolute model concordance (p
value). Fidelity maximization allows identification of the most concordant
model distribution, generating a method for parameter estimation, with
neighboring, less concordant distributions providing the "uncertainty" in this
estimate. Maximum fidelity provides an optimal approach for parameter
estimation (superior to maximum likelihood) and a generally optimal approach
for goodness-of-fit assessment of arbitrary models applied to univariate data.
Extensions to binary data, binned data, multidimensional data, and classical
parametric and nonparametric statistical tests are described. Maximum fidelity
provides a philosophically consistent, robust, and seemingly optimal foundation
for statistical inference. All findings are presented in an elementary way to
be immediately accessible to all researchers utilizing statistical analysis.Comment: 66 pages, 32 figures, 7 tables, submitte
Efficient Cosmological Parameter Estimation from Microwave Background Anisotropies
We revisit the issue of cosmological parameter estimation in light of current
and upcoming high-precision measurements of the cosmic microwave background
power spectrum. Physical quantities which determine the power spectrum are
reviewed, and their connection to familiar cosmological parameters is
explicated. We present a set of physical parameters, analytic functions of the
usual cosmological parameters, upon which the microwave background power
spectrum depends linearly (or with some other simple dependence) over a wide
range of parameter values. With such a set of parameters, microwave background
power spectra can be estimated with high accuracy and negligible computational
effort, vastly increasing the efficiency of cosmological parameter error
determination. The techniques presented here allow calculation of microwave
background power spectra times faster than comparably accurate direct
codes (after precomputing a handful of power spectra). We discuss various
issues of parameter estimation, including parameter degeneracies, numerical
precision, mapping between physical and cosmological parameters, and systematic
errors, and illustrate these considerations with an idealized model of the MAP
experiment.Comment: 22 pages, 12 figure
Statistical mechanics of transcription-factor binding site discovery using Hidden Markov Models
Hidden Markov Models (HMMs) are a commonly used tool for inference of
transcription factor (TF) binding sites from DNA sequence data. We exploit the
mathematical equivalence between HMMs for TF binding and the "inverse"
statistical mechanics of hard rods in a one-dimensional disordered potential to
investigate learning in HMMs. We derive analytic expressions for the Fisher
information, a commonly employed measure of confidence in learned parameters,
in the biologically relevant limit where the density of binding sites is low.
We then use techniques from statistical mechanics to derive a scaling principle
relating the specificity (binding energy) of a TF to the minimum amount of
training data necessary to learn it.Comment: 25 pages, 2 figures, 1 table V2 - typos fixed and new references
adde
Bayesian Cluster Enumeration Criterion for Unsupervised Learning
We derive a new Bayesian Information Criterion (BIC) by formulating the
problem of estimating the number of clusters in an observed data set as
maximization of the posterior probability of the candidate models. Given that
some mild assumptions are satisfied, we provide a general BIC expression for a
broad class of data distributions. This serves as a starting point when
deriving the BIC for specific distributions. Along this line, we provide a
closed-form BIC expression for multivariate Gaussian distributed variables. We
show that incorporating the data structure of the clustering problem into the
derivation of the BIC results in an expression whose penalty term is different
from that of the original BIC. We propose a two-step cluster enumeration
algorithm. First, a model-based unsupervised learning algorithm partitions the
data according to a given set of candidate models. Subsequently, the number of
clusters is determined as the one associated with the model for which the
proposed BIC is maximal. The performance of the proposed two-step algorithm is
tested using synthetic and real data sets.Comment: 14 pages, 7 figure
- …