551 research outputs found
A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions
In Phys. Rev. Letters (73:2, 5 Dec. 94), Mantegna et al. conclude on the
basis of Zipf rank frequency data that noncoding DNA sequence regions are more
like natural languages than coding regions. We argue on the contrary that an
empirical fit to Zipf's ``law'' cannot be used as a criterion for similarity to
natural languages. Although DNA is a presumably an ``organized system of
signs'' in Mandelbrot's (1961) sense, an observation of statistical features of
the sort presented in the Mantegna et al. paper does not shed light on the
similarity between DNA's ``grammar'' and natural language grammars, just as the
observation of exact Zipf-like behavior cannot distinguish between the
underlying processes of tossing an sided die or a finite-state branching
process.Comment: compressed uuencoded postscript file: 14 page
Optimal coding and the origins of Zipfian laws
The problem of compression in standard information theory consists of
assigning codes as short as possible to numbers. Here we consider the problem
of optimal coding -- under an arbitrary coding scheme -- and show that it
predicts Zipf's law of abbreviation, namely a tendency in natural languages for
more frequent words to be shorter. We apply this result to investigate optimal
coding also under so-called non-singular coding, a scheme where unique
segmentation is not warranted but codes stand for a distinct number. Optimal
non-singular coding predicts that the length of a word should grow
approximately as the logarithm of its frequency rank, which is again consistent
with Zipf's law of abbreviation. Optimal non-singular coding in combination
with the maximum entropy principle also predicts Zipf's rank-frequency
distribution. Furthermore, our findings on optimal non-singular coding
challenge common beliefs about random typing. It turns out that random typing
is in fact an optimal coding process, in stark contrast with the common
assumption that it is detached from cost cutting considerations. Finally, we
discuss the implications of optimal coding for the construction of a compact
theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of
concordant pair corrected, proofs polished, references update
Fast Entropy Estimation for Natural Sequences
It is well known that to estimate the Shannon entropy for symbolic sequences
accurately requires a large number of samples. When some aspects of the data
are known it is plausible to attempt to use this to more efficiently compute
entropy. A number of methods having various assumptions have been proposed
which can be used to calculate entropy for small sample sizes. In this paper,
we examine this problem and propose a method for estimating the Shannon entropy
for a set of ranked symbolic natural events. Using a modified
Zipf-Mandelbrot-Li law and a new rank-based coincidence counting method, we
propose an efficient algorithm which enables the entropy to be estimated with
surprising accuracy using only a small number of samples. The algorithm is
tested on some natural sequences and shown to yield accurate results with very
small amounts of data
Regimes in Babel are Confirmed: Report on Findings in Several Indonesian Ethnic Biblical Texts
The paper introduces the presence of three statistical regimes in the Zipfian analysis of texts in quantitative linguistics: the Mandelbrot, original Zipf, and Cancho- Solé-Montemurro regimes. The work is carried out over nine different languages of the same intention semantically: the bible from different languages in Indonesian ethnic and national language. As always, the same analysis is also brought in English version of the Bible for reference. The existence of the three regimes are confirmed while in advance the length of the texts are also becomes an important issue. We outline some further works regarding the quantitative analysis for parameterization used to analyze the three regimes and the task to have broad explanation, especially the microstructure of the language in human decision or linguistic effort – emerging the robustness of them
Two Universality Properties Associated with the Monkey Model of Zipf's Law
The distribution of word probabilities in the monkey model of Zipf's law is
associated with two universality properties: (1) the power law exponent
converges strongly to as the alphabet size increases and the letter
probabilities are specified as the spacings from a random division of the unit
interval for any distribution with a bounded density function on ; and
(2), on a logarithmic scale the version of the model with a finite word length
cutoff and unequal letter probabilities is approximately normally distributed
in the part of the distribution away from the tails. The first property is
proved using a remarkably general limit theorem for the logarithm of sample
spacings from Shao and Hahn, and the second property follows from Anscombe's
central limit theorem for a random number of i.i.d. random variables. The
finite word length model leads to a hybrid Zipf-lognormal mixture distribution
closely related to work in other areas.Comment: 14 pages, 3 figure
When do finite sample effects significantly affect entropy estimates ?
An expression is proposed for determining the error caused on entropy
estimates by finite sample effects. This expression is based on the Ansatz that
the ranked distribution of probabilities tends to follow an empirical Zipf law.Comment: 10 pages, 2 figure
Are citations of scientific papers a case of nonextensivity ?
The distribution of citations of scientific papers has recently been
illustrated (on ISI and PRE data sets) and analyzed by Redner [Eur. Phys. J. B
{\bf 4}, 131 (1998)]. To fit the data, a stretched exponential () has been used with only partial success. The success
is not complete because the data exhibit, for large citation count , a power
law (roughly for the ISI data), which, clearly, the
stretched exponential does not reproduce. This fact is then attributed to a
possibly different nature of rarely cited and largely cited papers. We show
here that, within a nonextensive thermostatistical formalism, the same data can
be quite satisfactorily fitted with a single curve (namely, for the available values of . This is
consistent with the connection recently established by Denisov [Phys. Lett. A
{\bf 235}, 447 (1997)] between this nonextensive formalism and the
Zipf-Mandelbrot law. What the present analysis ultimately suggests is that, in
contrast to Redner's conclusion, the phenomenon might essentially be one and
the same along the entire range of the citation number .Comment: Revtex,1 Figure postscript;[email protected]
- …