7,256 research outputs found
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
Scaling laws and fluctuations in the statistics of word frequencies
In this paper we combine statistical analysis of large text databases and
simple stochastic models to explain the appearance of scaling laws in the
statistics of word frequencies. Besides the sublinear scaling of the vocabulary
size with database size (Heaps' law), here we report a new scaling of the
fluctuations around this average (fluctuation scaling analysis). We explain
both scaling laws by modeling the usage of words by simple stochastic processes
in which the overall distribution of word-frequencies is fat tailed (Zipf's
law) and the frequency of a single word is subject to fluctuations across
documents (as in topic models). In this framework, the mean and the variance of
the vocabulary size can be expressed as quenched averages, implying that: i)
the inhomogeneous dissemination of words cause a reduction of the average
vocabulary size in comparison to the homogeneous case, and ii) correlations in
the co-occurrence of words lead to an increase in the variance and the
vocabulary size becomes a non-self-averaging quantity. We address the
implications of these observations to the measurement of lexical richness. We
test our results in three large text databases (Google-ngram, Enlgish
Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure
Dynamics and symmetries of a field partitioned by an accelerated frame
The canonical evolution and symmetry generators are exhibited for a
Klein-Gordon (K-G) system which has been partitioned by an accelerated
coordinate frame into a pair of subsystems. This partitioning of the K-G system
is conveyed to the canonical generators by the eigenfunction property of the
Minkowski Bessel (M-B) modes. In terms of the M-B degrees of freedom, which are
unitarily related to those of the Minkowski plane waves, a near complete
diagonalization of these generators can be realized.Comment: 14 pages, PlainTex. Related papers on accelerated frames available at
http://www.math.ohio-state.edu/~gerlac
A network approach to topic models
One of the main computational and scientific challenges in the modern age is
to extract useful information from unstructured texts. Topic models are one
popular machine-learning approach which infers the latent topical structure of
a collection of documents. Despite their success --- in particular of its most
widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous
applications in sociology, history, and linguistics, topic models are known to
suffer from severe conceptual and practical problems, e.g. a lack of
justification for the Bayesian priors, discrepancies with statistical
properties of real texts, and the inability to properly choose the number of
topics. Here we obtain a fresh view on the problem of identifying topical
structures by relating it to the problem of finding communities in complex
networks. This is achieved by representing text corpora as bipartite networks
of documents and words. By adapting existing community-detection methods --
using a stochastic block model (SBM) with non-parametric priors -- we obtain a
more versatile and principled framework for topic modeling (e.g., it
automatically detects the number of topics and hierarchically clusters both the
words and documents). The analysis of artificial and real corpora demonstrates
that our SBM approach leads to better topic models than LDA in terms of
statistical model selection. More importantly, our work shows how to formally
relate methods from community detection and topic modeling, opening the
possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io
Using text analysis to quantify the similarity and evolution of scientific disciplines
We use an information-theoretic measure of linguistic similarity to
investigate the organization and evolution of scientific fields. An analysis of
almost 20M papers from the past three decades reveals that the linguistic
similarity is related but different from experts and citation-based
classifications, leading to an improved view on the organization of science. A
temporal analysis of the similarity of fields shows that some fields (e.g.,
computer science) are becoming increasingly central, but that on average the
similarity between pairs has not changed in the last decades. This suggests
that tendencies of convergence (e.g., multi-disciplinarity) and divergence
(e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure
Extracting information from S-curves of language change
It is well accepted that adoption of innovations are described by S-curves
(slow start, accelerating period, and slow end). In this paper, we analyze how
much information on the dynamics of innovation spreading can be obtained from a
quantitative description of S-curves. We focus on the adoption of linguistic
innovations for which detailed databases of written texts from the last 200
years allow for an unprecedented statistical precision. Combining data analysis
with simulations of simple models (e.g., the Bass dynamics on complex networks)
we identify signatures of endogenous and exogenous factors in the S-curves of
adoption. We propose a measure to quantify the strength of these factors and
three different methods to estimate it from S-curves. We obtain cases in which
the exogenous factors are dominant (in the adoption of German orthographic
reforms and of one irregular verb) and cases in which endogenous factors are
dominant (in the adoption of conventions for romanization of Russian names and
in the regularization of most studied verbs). These results show that the shape
of S-curve is not universal and contains information on the adoption mechanism.
(published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI:
http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at
http://dx.doi.org/10.6084/m9.figshare.122178
High order three part split symplectic integrators: Efficient techniques for the long time simulation of the disordered discrete nonlinear Schroedinger equation
While symplectic integration methods based on operator splitting are well
established in many branches of science, high order methods for Hamiltonian
systems that split in more than two parts have not been studied in great
detail. Here, we present several high order symplectic integrators for
Hamiltonian systems that can be split in exactly three integrable parts. We
apply these techniques, as a practical case, for the integration of the
disordered, discrete nonlinear Schroedinger equation (DDNLS) and compare their
efficiencies. Three part split algorithms provide effective means to
numerically study the asymptotic behavior of wave packet spreading in the DDNLS
- a hotly debated subject in current scientific literature.Comment: 5 Figures, Physics Letters A (accepted
Inappropriateness of the Rindler quantization
It is argued that the Rindler quantization is not a correct approach to study
the effects of acceleration on quantum fields. First, the "particle"-detector
approach based on the Minkowski quantization is not equivalent to the approach
based on the Rindler quantization. Second, the event horizon, which plays the
essential role in the Rindler quantization, cannot play any physical role for a
local noninertial observer.Comment: 3 pages, accepted for publication in Mod. Phys. Lett.
Coulomb field of an accelerated charge: physical and mathematical aspects
The Maxwell field equations relative to a uniformly accelerated frame, and
the variational principle from which they are obtained, are formulated in terms
of the technique of geometrical gauge invariant potentials. They refer to the
transverse magnetic (TM) and the transeverse electric (TE) modes. This gauge
invariant "2+2" decomposition is used to see how the Coulomb field of a charge,
static in an accelerated frame, has properties that suggest features of
electromagnetism which are different from those in an inertial frame. In
particular, (1) an illustrative calculation shows that the Larmor radiation
reaction equals the electrostatic attraction between the accelerated charge and
the charge induced on the surface whose history is the event horizon, and (2) a
spectral decomposition of the Coulomb potential in the accelerated frame
suggests the possibility that the distortive effects of this charge on the
Rindler vacuum are akin to those of a charge on a crystal lattice.Comment: 27 pages, PlainTex. Related papers available at
http://www.math.ohio-state.edu/~gerlac
- …