28 research outputs found
Information content versus word length in random typing
Recently, it has been claimed that a linear relationship between a measure of
information content and word length is expected from word length optimization
and it has been shown that this linearity is supported by a strong correlation
between information content and word length in many languages (Piantadosi et
al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections
between this measure and standard information theory. The relationship between
the measure and word length is studied for the popular random typing process
where a text is constructed by pressing keys at random from a keyboard
containing letters and a space behaving as a word delimiter. Although this
random process does not optimize word lengths according to information content,
it exhibits a linear relationship between information content and word length.
The exact slope and intercept are presented for three major variants of the
random typing process. A strong correlation between information content and
word length can simply arise from the units making a word (e.g., letters) and
not necessarily from the interplay between a word and its context as proposed
by Piantadosi et al. In itself, the linear relation does not entail the results
of any optimization process
The span of correlations in dolphin whistle sequences
Long-range correlations are found in symbolic sequences from human language, music and DNA. Determining the span of correlations in dolphin whistle sequences is crucial for shedding light on their communicative complexity. Dolphin whistles share various statistical properties with human words, i.e. Zipf's law for word frequencies (namely that the probability of the ith most frequent word of a text is about i-a) and a parallel of the tendency of more frequent words to have more meanings. The finding of Zipf's law for word frequencies in dolphin whistles has been the topic of an intense debate on its implications. One of the major arguments against the relevance of Zipf's law in dolphin whistles is that it is not possible to distinguish the outcome of a die-rolling experiment from that of a linguistic or communicative source producing Zipf's law for word frequencies. Here we show that statistically significant whistle–whistle correlations extend back to the second previous whistle in the sequence, using a global randomization test, and to the fourth previous whistle, using a local randomization test. None of these correlations are expected by a die-rolling experiment and other simple explanations of Zipf's law for word frequencies, such as Simon's model, that produce sequences of unpredictable elements.Peer ReviewedPostprint (author's final draft
Memory and long-range correlations in chess games
In this paper we report the existence of long-range memory in the opening
moves of a chronologically ordered set of chess games using an extensive chess
database. We used two mapping rules to build discrete time series and analyzed
them using two methods for detecting long-range correlations; rescaled range
analysis and detrented fluctuation analysis. We found that long-range memory is
related to the level of the players. When the database is filtered according to
player levels we found differences in the persistence of the different subsets.
For high level players, correlations are stronger at long time scales; whereas
in intermediate and low level players they reach the maximum value at shorter
time scales. This can be interpreted as a signature of the different strategies
used by players with different levels of expertise. These results are robust
against the assignation rules and the method employed in the analysis of the
time series.Comment: 12 pages, 5 figures. Published in Physica
Türkçe Metinlerin Dinamik Analizi
Konferans Bildirisi -- Teorik ve Uygulamalı Mekanik Türk Milli Komitesi, 2008Conference Paper -- Theoretical and Applied Mechanical Turkish National Committee, 2008Bu çalışmada kelime sıklığı dağarlarına dayanmayan yeni bir bağımlı değişken önerilmiş, bu parametrizasyon yardımı ile değişik doğal dillerdeki metinlerden zaman serisi oluşturulmuş, uzun erimli korelasyon ve fraktal yapıların var olabileceği gösterilmiştir. Üssel korelasyonlarını betimleyen bir ölçeklenme katsayısı bulmak için eğilimlerden arındırılmış dalgalanma analizi ve Difuzyon Entropisi Analizi kullanılmış, bu iki yöntemin iki değişik rejimi içeren uyumlu sonuç verdiği ve değişik dillerin bu analizle betimlenebileceği ve birbirinden ayırt edilebileceği gösterilmiştir.In this work, a new dependent variable that is not based on word frequency corpora has been proposed. Using this parameterization, time series have been formed from texts in natural languages, the existence of long range correlations and fractal structures has been demonstrated. A scaling parameter that determines the long range exponential correlations has been sought both by Detrended Fluctuation Analysis-Dfa and Diffusion Entropy analysis. Results from both analyses are in agreement and reveal two distinct regimes that can characterize and distinguish different languages
Quantifying origin and character of long-range correlations in narrative texts
In natural language using short sentences is considered efficient for
communication. However, a text composed exclusively of such sentences looks
technical and reads boring. A text composed of long ones, on the other hand,
demands significantly more effort for comprehension. Studying characteristics
of the sentence length variability (SLV) in a large corpus of world-famous
literary texts shows that an appealing and aesthetic optimum appears somewhere
in between and involves selfsimilar, cascade-like alternation of various
lengths sentences. A related quantitative observation is that the power spectra
S(f) of thus characterized SLV universally develop a convincing `1/f^beta'
scaling with the average exponent beta =~ 1/2, close to what has been
identified before in musical compositions or in the brain waves. An
overwhelming majority of the studied texts simply obeys such fractal attributes
but especially spectacular in this respect are hypertext-like, "stream of
consciousness" novels. In addition, they appear to develop structures
characteristic of irreducibly interwoven sets of fractals called multifractals.
Scaling of S(f) in the present context implies existence of the long-range
correlations in texts and appearance of multifractality indicates that they
carry even a nonlinear component. A distinct role of the full stops in inducing
the long-range correlations in texts is evidenced by the fact that the above
quantitative characteristics on the long-range correlations manifest themselves
in variation of the full stops recurrence times along texts, thus in SLV, but
to a much lesser degree in the recurrence times of the most frequent words. In
this latter case the nonlinear correlations, thus multifractality, disappear
even completely for all the texts considered. Treated as one extra word, the
full stops at the same time appear to obey the Zipfian rank-frequency
distribution, however.Comment: 28 pages, 8 figures, accepted for publication in Information Science
Do Neural Nets Learn Statistical Laws behind Natural Language?
The performance of deep learning in natural language processing has been
spectacular, but the reasons for this success remain unclear because of the
inherent complexity of deep learning. This paper provides empirical evidence of
its effectiveness and of a limitation of neural networks for language
engineering. Precisely, we demonstrate that a neural language model based on
long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law,
two representative statistical properties underlying natural language. We
discuss the quality of reproducibility and the emergence of Zipf's law and
Heaps' law as training progresses. We also point out that the neural language
model has a limitation in reproducing long-range correlation, another
statistical property of natural language. This understanding could provide a
direction for improving the architectures of neural networks.Comment: 21 pages, 11 figure
Languages cool as they expand: Allometric scaling and the decreasing need for new words
We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature
Constant conditional entropy and related hypotheses
Constant entropy rate (conditional entropies must remain constant as the sequence length increases) and uniform information density (conditional probabilities must remain constant as the sequence length increases) are two information theoretic principles that are argued to underlie a wide range of linguistic phenomena. Here we revise the predictions of these principles in the light of Hilberg's law on the scaling of conditional entropy in language and related laws. We show that constant entropy rate (CER) and two interpretations for uniform information density (UID), full UID and strong UID, are inconsistent with these laws. Strong UID implies CER but the reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated sequences that are totally unrealistic. We conclude that CER and its particular cases are incomplete hypotheses about the scaling of conditional entropies.Peer ReviewedPostprint (author's final draft