29,020 research outputs found
Rank-frequency relation for Chinese characters
We show that the Zipf's law for Chinese characters perfectly holds for
sufficiently short texts (few thousand different characters). The scenario of
its validity is similar to the Zipf's law for words in short English texts. For
long Chinese texts (or for mixtures of short Chinese texts), rank-frequency
relations for Chinese characters display a two-layer, hierarchic structure that
combines a Zipfian power-law regime for frequent characters (first layer) with
an exponential-like regime for less frequent characters (second layer). For
these two layers we provide different (though related) theoretical descriptions
that include the range of low-frequency characters (hapax legomena). The
comparative analysis of rank-frequency relations for Chinese characters versus
English words illustrates the extent to which the characters play for Chinese
writers the same role as the words for those writing within alphabetical
systems.Comment: To appear in European Physical Journal B (EPJ B), 2014 (22 pages, 7
figures
Do Neural Nets Learn Statistical Laws behind Natural Language?
The performance of deep learning in natural language processing has been
spectacular, but the reasons for this success remain unclear because of the
inherent complexity of deep learning. This paper provides empirical evidence of
its effectiveness and of a limitation of neural networks for language
engineering. Precisely, we demonstrate that a neural language model based on
long short-term memory (LSTM) effectively reproduces Zipf's law and Heaps' law,
two representative statistical properties underlying natural language. We
discuss the quality of reproducibility and the emergence of Zipf's law and
Heaps' law as training progresses. We also point out that the neural language
model has a limitation in reproducing long-range correlation, another
statistical property of natural language. This understanding could provide a
direction for improving the architectures of neural networks.Comment: 21 pages, 11 figure
Scaling Laws in Human Language
Zipf's law on word frequency is observed in English, French, Spanish,
Italian, and so on, yet it does not hold for Chinese, Japanese or Korean
characters. A model for writing process is proposed to explain the above
difference, which takes into account the effects of finite vocabulary size.
Experiments, simulations and analytical solution agree well with each other.
The results show that the frequency distribution follows a power law with
exponent being equal to 1, at which the corresponding Zipf's exponent diverges.
Actually, the distribution obeys exponential form in the Zipf's plot. Deviating
from the Heaps' law, the number of distinct words grows with the text length in
three stages: It grows linearly in the beginning, then turns to a logarithmical
form, and eventually saturates. This work refines previous understanding about
Zipf's law and Heaps' law in language systems.Comment: 6 pages, 4 figure
Polysemy and brevity versus frequency in language
The pioneering research of G. K. Zipf on the relationship between word
frequency and other word features led to the formulation of various linguistic
laws. The most popular is Zipf's law for word frequencies. Here we focus on two
laws that have been studied less intensively: the meaning-frequency law, i.e.
the tendency of more frequent words to be more polysemous, and the law of
abbreviation, i.e. the tendency of more frequent words to be shorter. In a
previous work, we tested the robustness of these Zipfian laws for English,
roughly measuring word length in number of characters and distinguishing adult
from child speech. In the present article, we extend our study to other
languages (Dutch and Spanish) and introduce two additional measures of length:
syllabic length and phonemic length. Our correlation analysis indicates that
both the meaning-frequency law and the law of abbreviation hold overall in all
the analyzed languages
Long-Range Correlation Underlying Childhood Language and Generative Models
Long-range correlation, a property of time series exhibiting long-term
memory, is mainly studied in the statistical physics domain and has been
reported to exist in natural language. Using a state-of-the-art method for such
analysis, long-range correlation is first shown to occur in long CHILDES data
sets. To understand why, Bayesian generative models of language, originally
proposed in the cognitive scientific domain, are investigated. Among
representative models, the Simon model was found to exhibit surprisingly good
long-range correlation, but not the Pitman-Yor model. Since the Simon model is
known not to correctly reflect the vocabulary growth of natural language, a
simple new model is devised as a conjunct of the Simon and Pitman-Yor models,
such that long-range correlation holds with a correct vocabulary growth rate.
The investigation overall suggests that uniform sampling is one cause of
long-range correlation and could thus have a relation with actual linguistic
processes
- …