1,648 research outputs found

    The frequency spectrum of finite samples from the intermittent silence process

    Get PDF
    It has been argued that the actual distribution of word frequencies could be reproduced or explained by generating a random sequence of letters and spaces according to the so-called intermittent silence process. The same kind of process could reproduce or explain the counts of other kinds of units from a wide range of disciplines. Taking the linguistic metaphor, we focus on the frequency spectrum, i.e., the number of words with a certain frequency, and the vocabulary size, i.e., the number of different words of text generated by an intermittent silence process. We derive and explain how to calculate accurately and efficiently the expected frequency spectrum and the expected vocabulary size as a function of the text size.Peer ReviewedPostprint (author's final draft

    Can simple models explain Zipf’s law for all exponents?

    Get PDF
    H. Simon proposed a simple stochastic process for explaining Zipf’s law for word frequencies. Here we introduce two similar generalizations of Simon’s model that cover the same range of exponents as the standard Simon model. The mathematical approach followed minimizes the amount of mathematical background needed for deriving the exponent, compared to previous approaches to the standard Simon’s model. Reviewing what is known from other simple explanations of Zipf’s law, we conclude there is no single radically simple explanation covering the whole range of variation of the exponent of Zipf’s law in humans. The meaningfulness of Zipf’s law for word frequencies remains an open question.Peer ReviewedPostprint (published version

    Current challenges for preseismic electromagnetic emissions: shedding light from micro-scale plastic flow, granular packings, phase transitions and self-affinity notion of fracture process

    Get PDF
    Are there credible electromagnetic (EM) EQ precursors? This a question debated in the scientific community and there may be legitimate reasons for the critical views. The negative view concerning the existence of EM precursors is enhanced by features that accompany their observation which are considered as paradox ones, namely, these signals: (i) are not observed at the time of EQs occurrence and during the aftershock period, (ii) are not accompanied by large precursory strain changes, (iii) are not accompanied by simultaneous geodetic or seismological precursors and (v) their traceability is considered problematic. In this work, the detected candidate EM precursors are studied through a shift in thinking towards the basic science findings relative to granular packings, micron-scale plastic flow, interface depinning, fracture size effects, concepts drawn from phase transitions, self-affine notion of fracture and faulting process, universal features of fracture surfaces, recent high quality laboratory studies, theoretical models and numerical simulations. Strict criteria are established for the definition of an emerged EM anomaly as a preseismic one, while, precursory EM features, which have been considered as paradoxes, are explained. A three-stage model for EQ generation by means of preseismic fracture-induced EM emissions is proposed. The claim that the observed EM precursors may permit a real-time and step-by-step monitoring of the EQ generation is tested

    Compression and the origins of Zipf's law for word frequencies

    Get PDF
    Here we sketch a new derivation of Zipf's law for word frequencies based on optimal coding. The structure of the derivation is reminiscent of Mandelbrot's random typing model but it has multiple advantages over random typing: (1) it starts from realistic cognitive pressures (2) it does not require fine tuning of parameters and (3) it sheds light on the origins of other statistical laws of language and thus can lead to a compact theory of linguistic laws. Our findings suggest that the recurrence of Zipf's law in human languages could originate from pressure for easy and fast communication.Comment: arguments have been improved; in press in Complexity (Wiley

    Optimal coding and the origins of Zipfian laws

    Full text link
    The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter. We apply this result to investigate optimal coding also under so-called non-singular coding, a scheme where unique segmentation is not warranted but codes stand for a distinct number. Optimal non-singular coding predicts that the length of a word should grow approximately as the logarithm of its frequency rank, which is again consistent with Zipf's law of abbreviation. Optimal non-singular coding in combination with the maximum entropy principle also predicts Zipf's rank-frequency distribution. Furthermore, our findings on optimal non-singular coding challenge common beliefs about random typing. It turns out that random typing is in fact an optimal coding process, in stark contrast with the common assumption that it is detached from cost cutting considerations. Finally, we discuss the implications of optimal coding for the construction of a compact theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of concordant pair corrected, proofs polished, references update

    Information content versus word length in random typing

    Get PDF
    Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi et al. In itself, the linear relation does not entail the results of any optimization process

    Parallels of human language in the behavior of bottlenose dolphins

    Get PDF
    A short review of similarities between dolphins and humans with the help of quantitative linguistics and information theory

    Two Universality Properties Associated with the Monkey Model of Zipf's Law

    Full text link
    The distribution of word probabilities in the monkey model of Zipf's law is associated with two universality properties: (1) the power law exponent converges strongly to 1-1 as the alphabet size increases and the letter probabilities are specified as the spacings from a random division of the unit interval for any distribution with a bounded density function on [0,1][0,1]; and (2), on a logarithmic scale the version of the model with a finite word length cutoff and unequal letter probabilities is approximately normally distributed in the part of the distribution away from the tails. The first property is proved using a remarkably general limit theorem for the logarithm of sample spacings from Shao and Hahn, and the second property follows from Anscombe's central limit theorem for a random number of i.i.d. random variables. The finite word length model leads to a hybrid Zipf-lognormal mixture distribution closely related to work in other areas.Comment: 14 pages, 3 figure
    corecore