2,131 research outputs found
On the power laws of language: word frequency distributions
About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s)
Maximum Fidelity
The most fundamental problem in statistics is the inference of an unknown
probability distribution from a finite number of samples. For a specific
observed data set, answers to the following questions would be desirable: (1)
Estimation: Which candidate distribution provides the best fit to the observed
data?, (2) Goodness-of-fit: How concordant is this distribution with the
observed data?, and (3) Uncertainty: How concordant are other candidate
distributions with the observed data? A simple unified approach for univariate
data that addresses these traditionally distinct statistical notions is
presented called "maximum fidelity". Maximum fidelity is a strict frequentist
approach that is fundamentally based on model concordance with the observed
data. The fidelity statistic is a general information measure based on the
coordinate-independent cumulative distribution and critical yet previously
neglected symmetry considerations. An approximation for the null distribution
of the fidelity allows its direct conversion to absolute model concordance (p
value). Fidelity maximization allows identification of the most concordant
model distribution, generating a method for parameter estimation, with
neighboring, less concordant distributions providing the "uncertainty" in this
estimate. Maximum fidelity provides an optimal approach for parameter
estimation (superior to maximum likelihood) and a generally optimal approach
for goodness-of-fit assessment of arbitrary models applied to univariate data.
Extensions to binary data, binned data, multidimensional data, and classical
parametric and nonparametric statistical tests are described. Maximum fidelity
provides a philosophically consistent, robust, and seemingly optimal foundation
for statistical inference. All findings are presented in an elementary way to
be immediately accessible to all researchers utilizing statistical analysis.Comment: 66 pages, 32 figures, 7 tables, submitte
Statistical methods in cosmology
The advent of large data-set in cosmology has meant that in the past 10 or 20
years our knowledge and understanding of the Universe has changed not only
quantitatively but also, and most importantly, qualitatively. Cosmologists rely
on data where a host of useful information is enclosed, but is encoded in a
non-trivial way. The challenges in extracting this information must be overcome
to make the most of a large experimental effort. Even after having converged to
a standard cosmological model (the LCDM model) we should keep in mind that this
model is described by 10 or more physical parameters and if we want to study
deviations from it, the number of parameters is even larger. Dealing with such
a high dimensional parameter space and finding parameters constraints is a
challenge on itself. Cosmologists want to be able to compare and combine
different data sets both for testing for possible disagreements (which could
indicate new physics) and for improving parameter determinations. Finally,
cosmologists in many cases want to find out, before actually doing the
experiment, how much one would be able to learn from it. For all these reasons,
sophisiticated statistical techniques are being employed in cosmology, and it
has become crucial to know some statistical background to understand recent
literature in the field. I will introduce some statistical tools that any
cosmologist should know about in order to be able to understand recently
published results from the analysis of cosmological data sets. I will not
present a complete and rigorous introduction to statistics as there are several
good books which are reported in the references. The reader should refer to
those.Comment: 31, pages, 6 figures, notes from 2nd Trans-Regio Winter school in
Passo del Tonale. To appear in Lectures Notes in Physics, "Lectures on
cosmology: Accelerated expansion of the universe" Feb 201
- …