14,940 research outputs found

    Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death

    Get PDF
    We analyze the dynamic properties of 10^7 words recorded in English, Spanish and Hebrew over the period 1800--2008 in order to gain insight into the coevolution of language and culture. We report language independent patterns useful as benchmarks for theoretical models of language evolution. A significantly decreasing (increasing) trend in the birth (death) rate of words indicates a recent shift in the selection laws governing word use. For new words, we observe a peak in the growth-rate fluctuations around 40 years after introduction, consistent with the typical entry time into standard dictionaries and the human generational timescale. Pronounced changes in the dynamics of language during periods of war shows that word correlations, occurring across time and between words, are largely influenced by coevolutionary social, technological, and political factors. We quantify cultural memory by analyzing the long-term correlations in the use of individual words using detrended fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined, eliminates substantial material and incorporates referee comments: 19 pages, 14 figures, 3 table

    Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death

    Get PDF
    How often a given word is used, relative to other words, can convey information about the word’s linguistic utility. Using Google word data for 3 languages over the 209-year period 1800–2008, we found by analyzing word use an anomalous recent change in the birth and death rates of words, which indicates a shift towards increased levels of competition between words as a result of new standardization technology. We demonstrate unexpected analogies between the growth dynamics of word use and the growth dynamics of economic institutions. Our results support the intriguing concept that a language’s lexicon is a generic arena for competition which evolves according to selection laws that are related to social, technological, and political trends. Specifically, the aggregate properties of language show pronounced differences during periods of world conflict, e.g. World War II

    Languages cool as they expand: Allometric scaling and the decreasing need for new words

    Get PDF
    We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature

    Stochastic model for the vocabulary growth in natural languages

    Full text link
    We propose a stochastic model for the number of different words in a given database which incorporates the dependence on the database size and historical changes. The main feature of our model is the existence of two different classes of words: (i) a finite number of core-words which have higher frequency and do not affect the probability of a new word to be used; and (ii) the remaining virtually infinite number of noncore-words which have lower frequency and once used reduce the probability of a new word to be used in the future. Our model relies on a careful analysis of the google-ngram database of books published in the last centuries and its main consequence is the generalization of Zipf's and Heaps' law to two scaling regimes. We confirm that these generalizations yield the best simple description of the data among generic descriptive models and that the two free parameters depend only on the language but not on the database. From the point of view of our model the main change on historical time scales is the composition of the specific words included in the finite list of core-words, which we observe to decay exponentially in time with a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages supplemental material; to appear in Physical Review

    Rank diversity of languages: Generic behavior in computational linguistics

    Full text link
    Statistical studies of languages have focused on the rank-frequency distribution of words. Instead, we introduce here a measure of how word ranks change in time and call this distribution \emph{rank diversity}. We calculate this diversity for books published in six European languages since 1800, and find that it follows a universal lognormal distribution. Based on the mean and standard deviation associated with the lognormal distribution, we define three different word regimes of languages: "heads" consist of words which almost do not change their rank in time, "bodies" are words of general use, while "tails" are comprised by context-specific words and vary their rank considerably in time. The heads and bodies reflect the size of language cores identified by linguists for basic communication. We propose a Gaussian random walk model which reproduces the rank variation of words in time and thus the diversity. Rank diversity of words can be understood as the result of random variations in rank, where the size of the variation depends on the rank itself. We find that the core size is similar for all languages studied

    Neutral evolution and turnover over centuries of English word popularity

    Get PDF
    Here we test Neutral models against the evolution of English word frequency and vocabulary at the population scale, as recorded in annual word frequencies from three centuries of English language books. Against these data, we test both static and dynamic predictions of two neutral models, including the relation between corpus size and vocabulary size, frequency distributions, and turnover within those frequency distributions. Although a commonly used Neutral model fails to replicate all these emergent properties at once, we find that modified two-stage Neutral model does replicate the static and dynamic properties of the corpus data. This two-stage model is meant to represent a relatively small corpus (population) of English books, analogous to a `canon', sampled by an exponentially increasing corpus of books in the wider population of authors. More broadly, this mode -- a smaller neutral model within a larger neutral model -- could represent more broadly those situations where mass attention is focused on a small subset of the cultural variants.Comment: 12 pages, 5 figures, 1 tabl
    • 

    corecore