29,378 research outputs found
Zipf's law, 1/f noise, and fractal hierarchy
Fractals, 1/f noise, Zipf's law, and the occurrence of large catastrophic
events are typical ubiquitous general empirical observations across the
individual sciences which cannot be understood within the set of references
developed within the specific scientific domains. All these observations are
associated with scaling laws and have caused a broad research interest in the
scientific circle. However, the inherent relationships between these scaling
phenomena are still pending questions remaining to be researched. In this
paper, theoretical derivation and mathematical experiments are employed to
reveal the analogy between fractal patterns, 1/f noise, and the Zipf
distribution. First, the multifractal process follows the generalized Zipf's
law empirically. Second, a 1/f spectrum is identical in mathematical form to
Zipf's law. Third, both 1/f spectra and Zipf's law can be converted into a
self-similar hierarchy. Fourth, fractals, 1/f spectra, Zipf's law, and the
occurrence of large catastrophic events can be described with similar
exponential laws and power laws. The self-similar hierarchy is a more general
framework or structure which can be used to encompass or unify different
scaling phenomena and rules in both physical and social systems such as cities,
rivers, earthquakes, fractals, 1/f noise, and rank-size distributions. The
mathematical laws on the hierarchical structure can provide us with a holistic
perspective of looking at complexity such as self-organized criticality (SOC).Comment: 20 pages, 9 figures, 3 table
Long-Range Correlation Underlying Childhood Language and Generative Models
Long-range correlation, a property of time series exhibiting long-term
memory, is mainly studied in the statistical physics domain and has been
reported to exist in natural language. Using a state-of-the-art method for such
analysis, long-range correlation is first shown to occur in long CHILDES data
sets. To understand why, Bayesian generative models of language, originally
proposed in the cognitive scientific domain, are investigated. Among
representative models, the Simon model was found to exhibit surprisingly good
long-range correlation, but not the Pitman-Yor model. Since the Simon model is
known not to correctly reflect the vocabulary growth of natural language, a
simple new model is devised as a conjunct of the Simon and Pitman-Yor models,
such that long-range correlation holds with a correct vocabulary growth rate.
The investigation overall suggests that uniform sampling is one cause of
long-range correlation and could thus have a relation with actual linguistic
processes
Rank diversity of languages: Generic behavior in computational linguistics
Statistical studies of languages have focused on the rank-frequency
distribution of words. Instead, we introduce here a measure of how word ranks
change in time and call this distribution \emph{rank diversity}. We calculate
this diversity for books published in six European languages since 1800, and
find that it follows a universal lognormal distribution. Based on the mean and
standard deviation associated with the lognormal distribution, we define three
different word regimes of languages: "heads" consist of words which almost do
not change their rank in time, "bodies" are words of general use, while "tails"
are comprised by context-specific words and vary their rank considerably in
time. The heads and bodies reflect the size of language cores identified by
linguists for basic communication. We propose a Gaussian random walk model
which reproduces the rank variation of words in time and thus the diversity.
Rank diversity of words can be understood as the result of random variations in
rank, where the size of the variation depends on the rank itself. We find that
the core size is similar for all languages studied
Stochastic model for the vocabulary growth in natural languages
We propose a stochastic model for the number of different words in a given
database which incorporates the dependence on the database size and historical
changes. The main feature of our model is the existence of two different
classes of words: (i) a finite number of core-words which have higher frequency
and do not affect the probability of a new word to be used; and (ii) the
remaining virtually infinite number of noncore-words which have lower frequency
and once used reduce the probability of a new word to be used in the future.
Our model relies on a careful analysis of the google-ngram database of books
published in the last centuries and its main consequence is the generalization
of Zipf's and Heaps' law to two scaling regimes. We confirm that these
generalizations yield the best simple description of the data among generic
descriptive models and that the two free parameters depend only on the language
but not on the database. From the point of view of our model the main change on
historical time scales is the composition of the specific words included in the
finite list of core-words, which we observe to decay exponentially in time with
a rate of approximately 30 words per year for English.Comment: corrected typos and errors in reference list; 10 pages text, 15 pages
supplemental material; to appear in Physical Review
Large-scale analysis of Zipf's law in English texts
Despite being a paradigm of quantitative linguistics, Zipf's law for words
suffers from three main problems: its formulation is ambiguous, its validity
has not been tested rigorously from a statistical point of view, and it has not
been confronted to a representatively large number of texts. So, we can
summarize the current support of Zipf's law in texts as anecdotic.
We try to solve these issues by studying three different versions of Zipf's
law and fitting them to all available English texts in the Project Gutenberg
database (consisting of more than 30000 texts). To do so we use state-of-the
art tools in fitting and goodness-of-fit tests, carefully tailored to the
peculiarities of text statistics. Remarkably, one of the three versions of
Zipf's law, consisting of a pure power-law form in the complementary cumulative
distribution function of word frequencies, is able to fit more than 40% of the
texts in the database (at the 0.05 significance level), for the whole domain of
frequencies (from 1 to the maximum value) and with only one free parameter (the
exponent)
Optimal coding and the origins of Zipfian laws
The problem of compression in standard information theory consists of
assigning codes as short as possible to numbers. Here we consider the problem
of optimal coding -- under an arbitrary coding scheme -- and show that it
predicts Zipf's law of abbreviation, namely a tendency in natural languages for
more frequent words to be shorter. We apply this result to investigate optimal
coding also under so-called non-singular coding, a scheme where unique
segmentation is not warranted but codes stand for a distinct number. Optimal
non-singular coding predicts that the length of a word should grow
approximately as the logarithm of its frequency rank, which is again consistent
with Zipf's law of abbreviation. Optimal non-singular coding in combination
with the maximum entropy principle also predicts Zipf's rank-frequency
distribution. Furthermore, our findings on optimal non-singular coding
challenge common beliefs about random typing. It turns out that random typing
is in fact an optimal coding process, in stark contrast with the common
assumption that it is detached from cost cutting considerations. Finally, we
discuss the implications of optimal coding for the construction of a compact
theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of
concordant pair corrected, proofs polished, references update
- …