Search CORE

7,256 research outputs found

Stochastic model for the vocabulary growth in natural languages

Author: Altmann Eduardo G.
Gerlach Martin
Publication venue: 'American Physical Society (APS)'
Publication date: 04/04/2013
Field of study

We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

arXiv.org e-Print Archive

Directory of Open Access Journals

MPG.PuRe

Scaling laws and fluctuations in the statistics of word frequencies

Author: Altmann Eduardo G.
Gerlach Martin
Publication venue: 'IOP Publishing'
Publication date: 04/11/2014
Field of study

In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure

arXiv.org e-Print Archive

MPG.PuRe

Dynamics and symmetries of a field partitioned by an accelerated frame

Author: S. A. Fulling
U. H. Gerlach
U. H. Gerlach
Ulrich H. Gerlach
W. G. Unruh
W. Israel
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/1988
Field of study

The canonical evolution and symmetry generators are exhibited for a Klein-Gordon (K-G) system which has been partitioned by an accelerated coordinate frame into a pair of subsystems. This partitioning of the K-G system is conveyed to the canonical generators by the eigenfunction property of the Minkowski Bessel (M-B) modes. In terms of the M-B degrees of freedom, which are unitarily related to those of the Minkowski plane waves, a near complete diagonalization of these generators can be realized.Comment: 14 pages, PlainTex. Related papers on accelerated frames available at http://www.math.ohio-state.edu/~gerlac

arXiv.org e-Print Archive

Crossref

CERN Document Server

A network approach to topic models

Author: Altmann Eduardo G.
Gerlach Martin
Peixoto Tiago P.
Publication venue: 'American Association for the Advancement of Science (AAAS)'
Publication date: 04/07/2018
Field of study

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

arXiv.org e-Print Archive

MPG.PuRe

Using text analysis to quantify the similarity and evolution of scientific disciplines

Author: Altmann Eduardo G.
Dias Laercio
Gerlach Martin
Scharloth Joachim
Publication venue: 'The Royal Society'
Publication date: 27/06/2017
Field of study

We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g., computer science) are becoming increasingly central, but that on average the similarity between pairs has not changed in the last decades. This suggests that tendencies of convergence (e.g., multi-disciplinarity) and divergence (e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure

arXiv.org e-Print Archive

MPG.PuRe

Extracting information from S-curves of language change

Author: Altmann Eduardo G.
Gerlach Martin
Ghanbarnejad Fakhteh
Miotto Jose M.
Publication venue: 'The Royal Society'
Publication date: 30/10/2014
Field of study

It is well accepted that adoption of innovations are described by S-curves (slow start, accelerating period, and slow end). In this paper, we analyze how much information on the dynamics of innovation spreading can be obtained from a quantitative description of S-curves. We focus on the adoption of linguistic innovations for which detailed databases of written texts from the last 200 years allow for an unprecedented statistical precision. Combining data analysis with simulations of simple models (e.g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption. We propose a measure to quantify the strength of these factors and three different methods to estimate it from S-curves. We obtain cases in which the exogenous factors are dominant (in the adoption of German orthographic reforms and of one irregular verb) and cases in which endogenous factors are dominant (in the adoption of conventions for romanization of Russian names and in the regularization of most studied verbs). These results show that the shape of S-curve is not universal and contains information on the adoption mechanism. (published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI: http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at http://dx.doi.org/10.6084/m9.figshare.122178

arXiv.org e-Print Archive

PubMed Central

MPG.PuRe

High order three part split symplectic integrators: Efficient techniques for the long time simulation of the disordered discrete nonlinear Schroedinger equation

Author: Bodyfelt J. D.
Eggl S.
Gerlach E.
Papamikos G.
Skokos Ch.
Publication venue: 'Elsevier BV'
Publication date: 28/04/2014
Field of study

While symplectic integration methods based on operator splitting are well established in many branches of science, high order methods for Hamiltonian systems that split in more than two parts have not been studied in great detail. Here, we present several high order symplectic integrators for Hamiltonian systems that can be split in exactly three integrable parts. We apply these techniques, as a practical case, for the integration of the disordered, discrete nonlinear Schroedinger equation (DDNLS) and compare their efficiencies. Three part split algorithms provide effective means to numerically study the asymptotic behavior of wave packet spreading in the DDNLS - a hotly debated subject in current scientific literature.Comment: 5 Figures, Physics Letters A (accepted

arXiv.org e-Print Archive

University of Essex Research Repository

Crossref

HAL-INSU

HAL-OBSPM

Inappropriateness of the Rindler quantization

Author: Audretsch J.
Fulling S. A.
Gerlach U. H.
HRVOJE NIKOLIĆ
Unruh W. G.
Unruh W. G.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 29/03/2001
Field of study

It is argued that the Rindler quantization is not a correct approach to study the effects of acceleration on quantum fields. First, the "particle"-detector approach based on the Minkowski quantization is not equivalent to the approach based on the Rindler quantization. Second, the event horizon, which plays the essential role in the Rindler quantization, cannot play any physical role for a local noninertial observer.Comment: 3 pages, accepted for publication in Mod. Phys. Lett.

arXiv.org e-Print Archive

Crossref

Coulomb field of an accelerated charge: physical and mathematical aspects

Author: C Teitelboim
C. Teitelboim
C. Teitelboim
C. Teitelboim
D. A. Macdonald
D. A. Macdonald
D. G. Boulware
F. Strocchi
Francis J. Alexander
I. S. Gradshteyn
J. A. Wheeler
J. D. Bekenstein
J. D. Jackson
R. L. Znajek
R. P. Feynman
R. S. Hanni
S. A. Fulling
S. Schweber
T. Damour
T. Damour
U. H. Gerlach
U. H. Gerlach
U. H. Gerlach
U. H. Gerlach
U. H. Gerlach
U. H. Gerlach
Ulrich H. Gerlach
W. H. Press
W. N. Bailey
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/1991
Field of study

The Maxwell field equations relative to a uniformly accelerated frame, and the variational principle from which they are obtained, are formulated in terms of the technique of geometrical gauge invariant potentials. They refer to the transverse magnetic (TM) and the transeverse electric (TE) modes. This gauge invariant "2+2" decomposition is used to see how the Coulomb field of a charge, static in an accelerated frame, has properties that suggest features of electromagnetism which are different from those in an inertial frame. In particular, (1) an illustrative calculation shows that the Larmor radiation reaction equals the electrostatic attraction between the accelerated charge and the charge induced on the surface whose history is the event horizon, and (2) a spectral decomposition of the Coulomb potential in the accelerated frame suggests the possibility that the distortive effects of this charge on the Rindler vacuum are akin to those of a charge on a crystal lattice.Comment: 27 pages, PlainTex. Related papers available at http://www.math.ohio-state.edu/~gerlac

arXiv.org e-Print Archive

Crossref

CERN Document Server