50 research outputs found
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
The meaning-frequency law in Zipfian optimization models of communication
According to Zipf's meaning-frequency law, words that are more frequent tend
to have more meanings. Here it is shown that a linear dependency between the
frequency of a form and its number of meanings is found in a family of models
of Zipf's law for word frequencies. This is evidence for a weak version of the
meaning-frequency law. Interestingly, that weak law (a) is not an inevitable of
property of the assumptions of the family and (b) is found at least in the
narrow regime where those models exhibit Zipf's law for word frequencies
Log-log Convexity of Type-Token Growth in Zipf's Systems
It is traditionally assumed that Zipf's law implies the power-law growth of
the number of different elements with the total number of elements in a system
- the so-called Heaps' law. We show that a careful definition of Zipf's law
leads to the violation of Heaps' law in random systems, and obtain alternative
growth curves. These curves fulfill universal data collapses that only depend
on the value of the Zipf's exponent. We observe that real books behave very
much in the same way as random systems, despite the presence of burstiness in
word occurrence. We advance an explanation for this unexpected correspondence
Innovation and Nested Preferential Growth in Chess Playing Behavior
Complexity develops via the incorporation of innovative properties. Chess is
one of the most complex strategy games, where expert contenders exercise
decision making by imitating old games or introducing innovations. In this
work, we study innovation in chess by analyzing how different move sequences
are played at the population level. It is found that the probability of
exploring a new or innovative move decreases as a power law with the frequency
of the preceding move sequence. Chess players also exploit already known move
sequences according to their frequencies, following a preferential growth
mechanism. Furthermore, innovation in chess exhibits Heaps' law suggesting
similarities with the process of vocabulary growth. We propose a robust
generative mechanism based on nested Yule-Simon preferential growth processes
that reproduces the empirical observations. These results, supporting the
self-similar nature of innovations in chess are important in the context of
decision making in a competitive scenario, and extend the scope of relevant
findings recently discovered regarding the emergence of Zipf's law in chess.Comment: 8 pages, 4 figures, accepted for publication in Europhysics Letters
(EPL
Universal temporal features of rankings in competitive sports and games
Many complex phenomena, from the selection of traits in biological systems to
hierarchy formation in social and economic entities, show signs of competition
and heterogeneous performance in the temporal evolution of their components,
which may eventually lead to stratified structures such as the wealth
distribution worldwide. However, it is still unclear whether the road to
hierarchical complexity is determined by the particularities of each phenomena,
or if there are universal mechanisms of stratification common to many systems.
Human sports and games, with their (varied but simplified) rules of competition
and measures of performance, serve as an ideal test bed to look for universal
features of hierarchy formation. With this goal in mind, we analyse here the
behaviour of players and team rankings over time for several sports and games.
Even though, for a given time, the distribution of performance ranks varies
across activities, we find statistical regularities in the dynamics of ranks.
Specifically the rank diversity, a measure of the number of elements occupying
a given rank over a length of time, has the same functional form in sports and
games as in languages, another system where competition is determined by the
use or disuse of grammatical structures. Our results support the notion that
hierarchical phenomena may be driven by the same underlying mechanisms of rank
formation, regardless of the nature of their components. Moreover, such
regularities can in principle be used to predict lifetimes of rank occupancy,
thus increasing our ability to forecast stratification in the presence of
competition
Mapping the Americanization of English in Space and Time
As global political preeminence gradually shifted from the United Kingdom to
the United States, so did the capacity to culturally influence the rest of the
world. In this work, we analyze how the world-wide varieties of written English
are evolving. We study both the spatial and temporal variations of vocabulary
and spelling of English using a large corpus of geolocated tweets and the
Google Books datasets corresponding to books published in the US and the UK.
The advantage of our approach is that we can address both standard written
language (Google Books) and the more colloquial forms of microblogging messages
(Twitter). We find that American English is the dominant form of English
outside the UK and that its influence is felt even within the UK borders.
Finally, we analyze how this trend has evolved over time and the impact that
some cultural events have had in shaping it.Comment: 16 pages, 6 figures, 2 tables. Published versio