Search CORE

29,020 research outputs found

Rank-frequency relation for Chinese characters

Author: Allahverdyan A. E.
Deng W. B.
Li B.
Wang Q. A.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/01/2014
Field of study

We show that the Zipf's law for Chinese characters perfectly holds for sufficiently short texts (few thousand different characters). The scenario of its validity is similar to the Zipf's law for words in short English texts. For long Chinese texts (or for mixtures of short Chinese texts), rank-frequency relations for Chinese characters display a two-layer, hierarchic structure that combines a Zipfian power-law regime for frequent characters (first layer) with an exponential-like regime for less frequent characters (second layer). For these two layers we provide different (though related) theoretical descriptions that include the range of low-frequency characters (hapax legomena). The comparative analysis of rank-frequency relations for Chinese characters versus English words illustrates the extent to which the characters play for Chinese writers the same role as the words for those writing within alphabetical systems.Comment: To appear in European Physical Journal B (EPJ B), 2014 (22 pages, 7 figures

arXiv.org e-Print Archive

EDP Sciences OAI-PMH repository (1.2.0)

Do Neural Nets Learn Statistical Laws behind Natural Language?

Author: Takahashi Shuntaro
Tanaka-Ishii Kumiko
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/11/2017
Field of study

The performance of deep learning in natural language processing has been spectacular, but the reasons for this success remain unclear because of the inherent complexity of deep learning. This paper provides empirical evidence of its effectiveness and of a limitation of neural networks for language engineering. Precisely, we demonstrate that a neural language model based on long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law, two representative statistical properties underlying natural language. We discuss the quality of reproducibility and the emergence of Zipf's law and Heaps' law as training progresses. We also point out that the neural language model has a limitation in reproducing long-range correlation, another statistical property of natural language. This understanding could provide a direction for improving the architectures of neural networks.Comment: 21 pages, 11 figure

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Scaling Laws in Human Language

Author: A Clauset
A Gelbukh
A-L Barabási
A-L Barabási
AM Petersen
BJ Kim
C Cattuto
C Cattuto
C Cattuto
D Abrams
D-H Wang
DH Zanette
E Lieberman
EG Altmann
H Jeong
H-Y Zhang
HA Simon
I Eliazar
I Kanter
J Gao
J Kleinberg
JC Lansey
L Lü
MA Nowak
MA Nowak
MA Serrano
MD Hauser
ML Goldstein
MV Simkin
P Bak
R Lambiotte
RFi Cancho
RFi Cancho
S Bernhardsson
S Pigolotti
SN Dorogovtsev
SN Dorogovtsev
W Ebeling
Z-K Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/02/2012
Field of study

Zipf's law on word frequency is observed in English, French, Spanish, Italian, and so on, yet it does not hold for Chinese, Japanese or Korean characters. A model for writing process is proposed to explain the above difference, which takes into account the effects of finite vocabulary size. Experiments, simulations and analytical solution agree well with each other. The results show that the frequency distribution follows a power law with exponent being equal to 1, at which the corresponding Zipf's exponent diverges. Actually, the distribution obeys exponential form in the Zipf's plot. Deviating from the Heaps' law, the number of distinct words grows with the text length in three stages: It grows linearly in the beginning, then turns to a logarithmical form, and eventually saturates. This work refines previous understanding about Zipf's law and Heaps' law in language systems.Comment: 6 pages, 4 figure

arXiv.org e-Print Archive

Crossref

PubMed Central

RERO DOC Digital Library

Polysemy and brevity versus frequency in language

Author: Baixeries Jaume
Casas Bernardino
Català Neus
Ferrer-i-Cancho Ramon
Hernández-Fernández Antoni
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

The pioneering research of G. K. Zipf on the relationship between word frequency and other word features led to the formulation of various linguistic laws. The most popular is Zipf's law for word frequencies. Here we focus on two laws that have been studied less intensively: the meaning-frequency law, i.e. the tendency of more frequent words to be more polysemous, and the law of abbreviation, i.e. the tendency of more frequent words to be shorter. In a previous work, we tested the robustness of these Zipfian laws for English, roughly measuring word length in number of characters and distinguishing adult from child speech. In the present article, we extend our study to other languages (Dutch and Spanish) and introduce two additional measures of length: syllabic length and phonemic length. Our correlation analysis indicates that both the meaning-frequency law and the law of abbreviation hold overall in all the analyzed languages

arXiv.org e-Print Archive

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Long-Range Correlation Underlying Childhood Language and Generative Models

Author: Tanaka-Ishii Kumiko
Publication venue
Publication date: 10/12/2017
Field of study

Long-range correlation, a property of time series exhibiting long-term memory, is mainly studied in the statistical physics domain and has been reported to exist in natural language. Using a state-of-the-art method for such analysis, long-range correlation is first shown to occur in long CHILDES data sets. To understand why, Bayesian generative models of language, originally proposed in the cognitive scientific domain, are investigated. Among representative models, the Simon model was found to exhibit surprisingly good long-range correlation, but not the Pitman-Yor model. Since the Simon model is known not to correctly reflect the vocabulary growth of natural language, a simple new model is devised as a conjunct of the Simon and Pitman-Yor models, such that long-range correlation holds with a correct vocabulary growth rate. The investigation overall suggests that uniform sampling is one cause of long-range correlation and could thus have a relation with actual linguistic processes

arXiv.org e-Print Archive

Frontiers - Publisher Connector