7,202 research outputs found

    Stochastic model for the vocabulary growth in natural languages

    Full text link
    We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

    Scaling laws and fluctuations in the statistics of word frequencies

    Full text link
    In this paper we combine statistical analysis of large text databases and simple stochastic models to explain the appearance of scaling laws in the statistics of word frequencies. Besides the sublinear scaling of the vocabulary size with database size (Heaps' law), here we report a new scaling of the fluctuations around this average (fluctuation scaling analysis). We explain both scaling laws by modeling the usage of words by simple stochastic processes in which the overall distribution of word-frequencies is fat tailed (Zipf's law) and the frequency of a single word is subject to fluctuations across documents (as in topic models). In this framework, the mean and the variance of the vocabulary size can be expressed as quenched averages, implying that: i) the inhomogeneous dissemination of words cause a reduction of the average vocabulary size in comparison to the homogeneous case, and ii) correlations in the co-occurrence of words lead to an increase in the variance and the vocabulary size becomes a non-self-averaging quantity. We address the implications of these observations to the measurement of lexical richness. We test our results in three large text databases (Google-ngram, Enlgish Wikipedia, and a collection of scientific articles).Comment: 19 pages, 4 figure

    Dynamics and symmetries of a field partitioned by an accelerated frame

    Get PDF
    The canonical evolution and symmetry generators are exhibited for a Klein-Gordon (K-G) system which has been partitioned by an accelerated coordinate frame into a pair of subsystems. This partitioning of the K-G system is conveyed to the canonical generators by the eigenfunction property of the Minkowski Bessel (M-B) modes. In terms of the M-B degrees of freedom, which are unitarily related to those of the Minkowski plane waves, a near complete diagonalization of these generators can be realized.Comment: 14 pages, PlainTex. Related papers on accelerated frames available at http://www.math.ohio-state.edu/~gerlac

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Using text analysis to quantify the similarity and evolution of scientific disciplines

    Full text link
    We use an information-theoretic measure of linguistic similarity to investigate the organization and evolution of scientific fields. An analysis of almost 20M papers from the past three decades reveals that the linguistic similarity is related but different from experts and citation-based classifications, leading to an improved view on the organization of science. A temporal analysis of the similarity of fields shows that some fields (e.g., computer science) are becoming increasingly central, but that on average the similarity between pairs has not changed in the last decades. This suggests that tendencies of convergence (e.g., multi-disciplinarity) and divergence (e.g., specialization) of disciplines are in balance.Comment: 9 pages, 4 figure

    Extracting information from S-curves of language change

    Full text link
    It is well accepted that adoption of innovations are described by S-curves (slow start, accelerating period, and slow end). In this paper, we analyze how much information on the dynamics of innovation spreading can be obtained from a quantitative description of S-curves. We focus on the adoption of linguistic innovations for which detailed databases of written texts from the last 200 years allow for an unprecedented statistical precision. Combining data analysis with simulations of simple models (e.g., the Bass dynamics on complex networks) we identify signatures of endogenous and exogenous factors in the S-curves of adoption. We propose a measure to quantify the strength of these factors and three different methods to estimate it from S-curves. We obtain cases in which the exogenous factors are dominant (in the adoption of German orthographic reforms and of one irregular verb) and cases in which endogenous factors are dominant (in the adoption of conventions for romanization of Russian names and in the regularization of most studied verbs). These results show that the shape of S-curve is not universal and contains information on the adoption mechanism. (published at "J. R. Soc. Interface, vol. 11, no. 101, (2014) 1044"; DOI: http://dx.doi.org/10.1098/rsif.2014.1044)Comment: 9 pages, 5 figures, Supplementary Material is available at http://dx.doi.org/10.6084/m9.figshare.122178

    High order three part split symplectic integrators: Efficient techniques for the long time simulation of the disordered discrete nonlinear Schroedinger equation

    Get PDF
    While symplectic integration methods based on operator splitting are well established in many branches of science, high order methods for Hamiltonian systems that split in more than two parts have not been studied in great detail. Here, we present several high order symplectic integrators for Hamiltonian systems that can be split in exactly three integrable parts. We apply these techniques, as a practical case, for the integration of the disordered, discrete nonlinear Schroedinger equation (DDNLS) and compare their efficiencies. Three part split algorithms provide effective means to numerically study the asymptotic behavior of wave packet spreading in the DDNLS - a hotly debated subject in current scientific literature.Comment: 5 Figures, Physics Letters A (accepted

    Inappropriateness of the Rindler quantization

    Full text link
    It is argued that the Rindler quantization is not a correct approach to study the effects of acceleration on quantum fields. First, the "particle"-detector approach based on the Minkowski quantization is not equivalent to the approach based on the Rindler quantization. Second, the event horizon, which plays the essential role in the Rindler quantization, cannot play any physical role for a local noninertial observer.Comment: 3 pages, accepted for publication in Mod. Phys. Lett.

    Coulomb field of an accelerated charge: physical and mathematical aspects

    Get PDF
    The Maxwell field equations relative to a uniformly accelerated frame, and the variational principle from which they are obtained, are formulated in terms of the technique of geometrical gauge invariant potentials. They refer to the transverse magnetic (TM) and the transeverse electric (TE) modes. This gauge invariant "2+2" decomposition is used to see how the Coulomb field of a charge, static in an accelerated frame, has properties that suggest features of electromagnetism which are different from those in an inertial frame. In particular, (1) an illustrative calculation shows that the Larmor radiation reaction equals the electrostatic attraction between the accelerated charge and the charge induced on the surface whose history is the event horizon, and (2) a spectral decomposition of the Coulomb potential in the accelerated frame suggests the possibility that the distortive effects of this charge on the Rindler vacuum are akin to those of a charge on a crystal lattice.Comment: 27 pages, PlainTex. Related papers available at http://www.math.ohio-state.edu/~gerlac
    corecore