14,940 research outputs found
Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death
We analyze the dynamic properties of 10^7 words recorded in English, Spanish
and Hebrew over the period 1800--2008 in order to gain insight into the
coevolution of language and culture. We report language independent patterns
useful as benchmarks for theoretical models of language evolution. A
significantly decreasing (increasing) trend in the birth (death) rate of words
indicates a recent shift in the selection laws governing word use. For new
words, we observe a peak in the growth-rate fluctuations around 40 years after
introduction, consistent with the typical entry time into standard dictionaries
and the human generational timescale. Pronounced changes in the dynamics of
language during periods of war shows that word correlations, occurring across
time and between words, are largely influenced by coevolutionary social,
technological, and political factors. We quantify cultural memory by analyzing
the long-term correlations in the use of individual words using detrended
fluctuation analysis.Comment: Version 1: 31 pages, 17 figures, 3 tables. Version 2 is streamlined,
eliminates substantial material and incorporates referee comments: 19 pages,
14 figures, 3 table
Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death
How often a given word is used, relative to other words, can convey information about the wordâs linguistic utility. Using Google word data for 3 languages over the 209-year period 1800â2008, we found by analyzing word use an anomalous recent change in the birth and death rates of words, which indicates a shift towards increased levels of competition between words as a result of new standardization technology. We demonstrate unexpected analogies between the growth dynamics of word use and the growth dynamics of economic institutions. Our results support the intriguing concept that a languageâs lexicon is a generic arena for competition which evolves according to selection laws that are related to social, technological, and political trends. Specifically, the aggregate properties of language show pronounced differences during periods of world conflict, e.g. World War II
Languages cool as they expand: Allometric scaling and the decreasing need for new words
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ââcooling patternââ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
Rank diversity of languages: Generic behavior in computational linguistics
Statistical studies of languages have focused on the rank-frequency
distribution of words. Instead, we introduce here a measure of how word ranks
change in time and call this distribution \emph{rank diversity}. We calculate
this diversity for books published in six European languages since 1800, and
find that it follows a universal lognormal distribution. Based on the mean and
standard deviation associated with the lognormal distribution, we define three
different word regimes of languages: "heads" consist of words which almost do
not change their rank in time, "bodies" are words of general use, while "tails"
are comprised by context-specific words and vary their rank considerably in
time. The heads and bodies reflect the size of language cores identified by
linguists for basic communication. We propose a Gaussian random walk model
which reproduces the rank variation of words in time and thus the diversity.
Rank diversity of words can be understood as the result of random variations in
rank, where the size of the variation depends on the rank itself. We find that
the core size is similar for all languages studied
Neutral evolution and turnover over centuries of English word popularity
Here we test Neutral models against the evolution of English word frequency
and vocabulary at the population scale, as recorded in annual word frequencies
from three centuries of English language books. Against these data, we test
both static and dynamic predictions of two neutral models, including the
relation between corpus size and vocabulary size, frequency distributions, and
turnover within those frequency distributions. Although a commonly used Neutral
model fails to replicate all these emergent properties at once, we find that
modified two-stage Neutral model does replicate the static and dynamic
properties of the corpus data. This two-stage model is meant to represent a
relatively small corpus (population) of English books, analogous to a `canon',
sampled by an exponentially increasing corpus of books in the wider population
of authors. More broadly, this mode -- a smaller neutral model within a larger
neutral model -- could represent more broadly those situations where mass
attention is focused on a small subset of the cultural variants.Comment: 12 pages, 5 figures, 1 tabl
- âŠ