Search CORE

28 research outputs found

Information content versus word length in random typing

Author: Ferrer-i-Cancho Ramon
Martín Fermín Moscoso del Prado
Publication venue: 'IOP Publishing'
Publication date: 01/01/2011
Field of study

Recently, it has been claimed that a linear relationship between a measure of information content and word length is expected from word length optimization and it has been shown that this linearity is supported by a strong correlation between information content and word length in many languages (Piantadosi et al. 2011, PNAS 108, 3825-3826). Here, we study in detail some connections between this measure and standard information theory. The relationship between the measure and word length is studied for the popular random typing process where a text is constructed by pressing keys at random from a keyboard containing letters and a space behaving as a word delimiter. Although this random process does not optimize word lengths according to information content, it exhibits a linear relationship between information content and word length. The exact slope and intercept are presented for three major variants of the random typing process. A strong correlation between information content and word length can simply arise from the units making a word (e.g., letters) and not necessarily from the interplay between a word and its context as proposed by Piantadosi et al. In itself, the linear relation does not entail the results of any optimization process

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

HAL AMU

HAL Descartes

Hal-Diderot

The span of correlations in dolphin whistle sequences

Author: Ferrer Cancho Ramon
McCowan Brenda
Publication venue: 'IOP Publishing'
Publication date: 01/01/2012
Field of study

Long-range correlations are found in symbolic sequences from human language, music and DNA. Determining the span of correlations in dolphin whistle sequences is crucial for shedding light on their communicative complexity. Dolphin whistles share various statistical properties with human words, i.e. Zipf's law for word frequencies (namely that the probability of the ith most frequent word of a text is about i-a) and a parallel of the tendency of more frequent words to have more meanings. The finding of Zipf's law for word frequencies in dolphin whistles has been the topic of an intense debate on its implications. One of the major arguments against the relevance of Zipf's law in dolphin whistles is that it is not possible to distinguish the outcome of a die-rolling experiment from that of a linguistic or communicative source producing Zipf's law for word frequencies. Here we show that statistically significant whistle–whistle correlations extend back to the second previous whistle in the sequence, using a global randomization test, and to the fourth previous whistle, using a local randomization test. None of these correlations are expected by a die-rolling experiment and other simple explanations of Zipf's law for word frequencies, such as Simon's model, that produce sequences of unpredictable elements.Peer ReviewedPostprint (author's final draft

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Memory and long-range correlations in chess games

Author: Billoni Orlando V.
Perotti Juan I.
Schaigorodsky Ana L.
Publication venue: 'Elsevier BV'
Publication date: 01/09/2013
Field of study

In this paper we report the existence of long-range memory in the opening moves of a chronologically ordered set of chess games using an extensive chess database. We used two mapping rules to build discrete time series and analyzed them using two methods for detecting long-range correlations; rescaled range analysis and detrented fluctuation analysis. We found that long-range memory is related to the level of the players. When the database is filtered according to player levels we found differences in the persistence of the different subsets. For high level players, correlations are stronger at long time scales; whereas in intermediate and low level players they reach the maximum value at shorter time scales. This can be interpreted as a signature of the different strategies used by players with different levels of expertise. These results are robust against the assignation rules and the method employed in the analysis of the time series.Comment: 12 pages, 5 figures. Published in Physica

arXiv.org e-Print Archive

Crossref

CONICET Digital

Repositorio Digital de la Universidad Nacional de Córdoba

IMT Institutional Repository

Türkçe Metinlerin Dinamik Analizi

Author: Erentürk Murat
Hacınlıyan Avadis
Şahin Gökhan
Publication venue: Theoretical and Applied Mechanical Turkish National Committee
Publication date: 01/01/2008
Field of study

Konferans Bildirisi -- Teorik ve Uygulamalı Mekanik Türk Milli Komitesi, 2008Conference Paper -- Theoretical and Applied Mechanical Turkish National Committee, 2008Bu çalışmada kelime sıklığı dağarlarına dayanmayan yeni bir bağımlı değişken önerilmiş, bu parametrizasyon yardımı ile değişik doğal dillerdeki metinlerden zaman serisi oluşturulmuş, uzun erimli korelasyon ve fraktal yapıların var olabileceği gösterilmiştir. Üssel korelasyonlarını betimleyen bir ölçeklenme katsayısı bulmak için eğilimlerden arındırılmış dalgalanma analizi ve Difuzyon Entropisi Analizi kullanılmış, bu iki yöntemin iki değişik rejimi içeren uyumlu sonuç verdiği ve değişik dillerin bu analizle betimlenebileceği ve birbirinden ayırt edilebileceği gösterilmiştir.In this work, a new dependent variable that is not based on word frequency corpora has been proposed. Using this parameterization, time series have been formed from texts in natural languages, the existence of long range correlations and fractal structures has been demonstrated. A scaling parameter that determines the long range exponential correlations has been sought both by Detrended Fluctuation Analysis-Dfa and Diffusion Entropy analysis. Results from both analyses are in agreement and reveal two distinct regimes that can characterize and distinguish different languages

Ulusal Üniversitelerarası Açık Erişim Sistemi - İstanbul Teknik Üniversitesi

Quantifying origin and character of long-range correlations in narrative texts

Author: Bazarnik Katarzyna
Drożdż Stanisław
Grabska-Gradzińska Iwona
Kulig Andrzej
Kwapień Jarosław
Oświęcimka Paweł
Rybicki Jan
Stanuszek Marek
Publication venue
Publication date: 14/10/2015
Field of study

In natural language using short sentences is considered efficient for communication. However, a text composed exclusively of such sentences looks technical and reads boring. A text composed of long ones, on the other hand, demands significantly more effort for comprehension. Studying characteristics of the sentence length variability (SLV) in a large corpus of world-famous literary texts shows that an appealing and aesthetic optimum appears somewhere in between and involves selfsimilar, cascade-like alternation of various lengths sentences. A related quantitative observation is that the power spectra S(f) of thus characterized SLV universally develop a convincing `1/f^beta' scaling with the average exponent beta =~ 1/2, close to what has been identified before in musical compositions or in the brain waves. An overwhelming majority of the studied texts simply obeys such fractal attributes but especially spectacular in this respect are hypertext-like, "stream of consciousness" novels. In addition, they appear to develop structures characteristic of irreducibly interwoven sets of fractals called multifractals. Scaling of S(f) in the present context implies existence of the long-range correlations in texts and appearance of multifractality indicates that they carry even a nonlinear component. A distinct role of the full stops in inducing the long-range correlations in texts is evidenced by the fact that the above quantitative characteristics on the long-range correlations manifest themselves in variation of the full stops recurrence times along texts, thus in SLV, but to a much lesser degree in the recurrence times of the most frequent words. In this latter case the nonlinear correlations, thus multifractality, disappear even completely for all the texts considered. Treated as one extra word, the full stops at the same time appear to obey the Zipfian rank-frequency distribution, however.Comment: 28 pages, 8 figures, accepted for publication in Information Science

arXiv.org e-Print Archive

Jagiellonian Univeristy Repository

Do Neural Nets Learn Statistical Laws behind Natural Language?

Author: Takahashi Shuntaro
Tanaka-Ishii Kumiko
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/11/2017
Field of study

The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law, two representative statistical properties underlying natural language. We discuss the quality of reproducibility and the emergence of Zipf's law and Heaps' law as training progresses. We also point out that the neural language model has a limitation in reproducing long-range correlation, another statistical property of natural language. This understanding could provide a direction for improving the architectures of neural networks.Comment: 21 pages, 11 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Languages cool as they expand: Allometric scaling and the decreasing need for new words

Author: A Clauset
A Gnedin
A Vespignani
AA Tsonis
AL Barabási
AM Petersen
B Mandelbrot
B Podobnik
B Podobnik
D Fu
D Helbing
D Lazer
DC van Leijenhorst
E Alvarez-Lacalle
EA Altmann
EG Altmann
EG Altmann
GB West
GW Oehlert
HA Makse
HAJrJSA Makse
HD Rozenfeld
HD Rozenfeld
J Gao
J-B Michel
JA Evans
L Lü
LAN Amaral
LAN Amaral
LMA Bettencourt
M Batty
M Kleiber
M Markosova
M Riccaboni
M Sigman
M Steyvers
MA Montemurro
MEJ Newman
MHR Stanley
MÁ Serrano
R Ferrer i Cancho
R Ferrer i Cancho
R Ferrer i Cancho
RN Mantegna
S Bernhardsson
S Bernhardsson
S Karlin
SK Baek
X Gabaix
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/12/2012
Field of study

We analyze the occurrence frequencies of over 15 million words recorded in millions of books published during the past two centuries in seven different languages. For all languages and chronological subsets of the data we confirm that two scaling regimes characterize the word frequency distributions, with only the more common words obeying the classic Zipf law. Using corpora of unprecedented size, we test the allometric scaling relation between the corpus size and the vocabulary size of growing languages to demonstrate a decreasing marginal need for new words, a feature that is likely related to the underlying correlations between words. We calculate the annual growth fluctuations of word use which has a decreasing trend as the corpus size increases, indicating a slowdown in linguistic evolution following language expansion. This ‘‘cooling pattern’’ forms the basis of a third statistical regularity, which unlike the Zipf and the Heaps law, is dynamical in nature

arXiv.org e-Print Archive

Crossref

Boston University Institutional Repository (OpenBU)

Digital library of University of Maribor

PubMed Central

IMT Institutional Repository

Constant conditional entropy and related hypotheses

Author: Debowski Lukasz
Ferrer Cancho Ramon
Moscoso del Prado Martín Fermín
Publication venue: 'IOP Publishing'
Publication date: 01/01/2013
Field of study

Constant entropy rate (conditional entropies must remain constant as the sequence length increases) and uniform information density (conditional probabilities must remain constant as the sequence length increases) are two information theoretic principles that are argued to underlie a wide range of linguistic phenomena. Here we revise the predictions of these principles in the light of Hilberg's law on the scaling of conditional entropy in language and related laws. We show that constant entropy rate (CER) and two interpretations for uniform information density (UID), full UID and strong UID, are inconsistent with these laws. Strong UID implies CER but the reverse is not true. Full UID, a particular case of UID, leads to costly uncorrelated sequences that are totally unrealistic. We conclude that CER and its particular cases are incomplete hypotheses about the scaling of conditional entropies.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC