1,648 research outputs found
The frequency spectrum of finite samples from the intermittent silence process
It has been argued that the actual distribution of word frequencies could be reproduced or explained by generating a random sequence of letters and spaces according to the so-called intermittent silence process. The same kind of process could reproduce or explain the counts of other kinds of units from a wide range of disciplines. Taking the linguistic metaphor, we focus on the frequency spectrum, i.e., the number of words with a certain frequency, and the vocabulary size, i.e., the number of different words of text generated by an intermittent silence process. We derive and explain how to calculate accurately and efficiently the expected frequency spectrum and the expected vocabulary size as a function of the text size.Peer ReviewedPostprint (author's final draft
Can simple models explain Zipf’s law for all exponents?
H. Simon proposed a simple stochastic process for explaining Zipf’s law for word frequencies. Here we introduce two similar generalizations of Simon’s model that cover the same range of exponents as the standard Simon model. The mathematical approach followed minimizes the
amount of mathematical background needed for deriving the exponent, compared to previous approaches to the standard Simon’s model. Reviewing what is known from other simple explanations of Zipf’s law, we conclude there is no single radically simple explanation covering the whole range of variation of the exponent of Zipf’s law in humans. The meaningfulness of Zipf’s law for word frequencies remains an open question.Peer ReviewedPostprint (published version
Current challenges for preseismic electromagnetic emissions: shedding light from micro-scale plastic flow, granular packings, phase transitions and self-affinity notion of fracture process
Are there credible electromagnetic (EM) EQ precursors? This a question
debated in the scientific community and there may be legitimate reasons for the
critical views. The negative view concerning the existence of EM precursors is
enhanced by features that accompany their observation which are considered as
paradox ones, namely, these signals: (i) are not observed at the time of EQs
occurrence and during the aftershock period, (ii) are not accompanied by large
precursory strain changes, (iii) are not accompanied by simultaneous geodetic
or seismological precursors and (v) their traceability is considered
problematic. In this work, the detected candidate EM precursors are studied
through a shift in thinking towards the basic science findings relative to
granular packings, micron-scale plastic flow, interface depinning, fracture
size effects, concepts drawn from phase transitions, self-affine notion of
fracture and faulting process, universal features of fracture surfaces, recent
high quality laboratory studies, theoretical models and numerical simulations.
Strict criteria are established for the definition of an emerged EM anomaly as
a preseismic one, while, precursory EM features, which have been considered as
paradoxes, are explained. A three-stage model for EQ generation by means of
preseismic fracture-induced EM emissions is proposed. The claim that the
observed EM precursors may permit a real-time and step-by-step monitoring of
the EQ generation is tested
Compression and the origins of Zipf's law for word frequencies
Here we sketch a new derivation of Zipf's law for word frequencies based on
optimal coding. The structure of the derivation is reminiscent of Mandelbrot's
random typing model but it has multiple advantages over random typing: (1) it
starts from realistic cognitive pressures (2) it does not require fine tuning
of parameters and (3) it sheds light on the origins of other statistical laws
of language and thus can lead to a compact theory of linguistic laws. Our
findings suggest that the recurrence of Zipf's law in human languages could
originate from pressure for easy and fast communication.Comment: arguments have been improved; in press in Complexity (Wiley
IMPROVED MULTIPLE BIRDSONG TRACKING WITH DISTRIBUTION DERIVATIVE METHOD AND MARKOV RENEWAL PROCESS CLUSTERING
DS & MP are supported by an EPSRC Leadership Fellowship EP/G007144/1
Optimal coding and the origins of Zipfian laws
The problem of compression in standard information theory consists of
assigning codes as short as possible to numbers. Here we consider the problem
of optimal coding -- under an arbitrary coding scheme -- and show that it
predicts Zipf's law of abbreviation, namely a tendency in natural languages for
more frequent words to be shorter. We apply this result to investigate optimal
coding also under so-called non-singular coding, a scheme where unique
segmentation is not warranted but codes stand for a distinct number. Optimal
non-singular coding predicts that the length of a word should grow
approximately as the logarithm of its frequency rank, which is again consistent
with Zipf's law of abbreviation. Optimal non-singular coding in combination
with the maximum entropy principle also predicts Zipf's rank-frequency
distribution. Furthermore, our findings on optimal non-singular coding
challenge common beliefs about random typing. It turns out that random typing
is in fact an optimal coding process, in stark contrast with the common
assumption that it is detached from cost cutting considerations. Finally, we
discuss the implications of optimal coding for the construction of a compact
theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of
concordant pair corrected, proofs polished, references update
Information content versus word length in random typing
Recently, it has been claimed that a linear relationship between a measure of
information content and word length is expected from word length optimization
and it has been shown that this linearity is supported by a strong correlation
between information content and word length in many languages (Piantadosi et
al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections
between this measure and standard information theory. The relationship between
the measure and word length is studied for the popular random typing process
where a text is constructed by pressing keys at random from a keyboard
containing letters and a space behaving as a word delimiter. Although this
random process does not optimize word lengths according to information content,
it exhibits a linear relationship between information content and word length.
The exact slope and intercept are presented for three major variants of the
random typing process. A strong correlation between information content and
word length can simply arise from the units making a word (e.g., letters) and
not necessarily from the interplay between a word and its context as proposed
by Piantadosi et al. In itself, the linear relation does not entail the results
of any optimization process
Parallels of human language in the behavior of bottlenose dolphins
A short review of similarities between dolphins and humans with the help of
quantitative linguistics and information theory
Two Universality Properties Associated with the Monkey Model of Zipf's Law
The distribution of word probabilities in the monkey model of Zipf's law is
associated with two universality properties: (1) the power law exponent
converges strongly to as the alphabet size increases and the letter
probabilities are specified as the spacings from a random division of the unit
interval for any distribution with a bounded density function on ; and
(2), on a logarithmic scale the version of the model with a finite word length
cutoff and unequal letter probabilities is approximately normally distributed
in the part of the distribution away from the tails. The first property is
proved using a remarkably general limit theorem for the logarithm of sample
spacings from Shao and Hahn, and the second property follows from Anscombe's
central limit theorem for a random number of i.i.d. random variables. The
finite word length model leads to a hybrid Zipf-lognormal mixture distribution
closely related to work in other areas.Comment: 14 pages, 3 figure
- …