551 research outputs found

    A Note on Zipf's Law, Natural Languages, and Noncoding DNA regions

    Get PDF
    In Phys. Rev. Letters (73:2, 5 Dec. 94), Mantegna et al. conclude on the basis of Zipf rank frequency data that noncoding DNA sequence regions are more like natural languages than coding regions. We argue on the contrary that an empirical fit to Zipf's ``law'' cannot be used as a criterion for similarity to natural languages. Although DNA is a presumably an ``organized system of signs'' in Mandelbrot's (1961) sense, an observation of statistical features of the sort presented in the Mantegna et al. paper does not shed light on the similarity between DNA's ``grammar'' and natural language grammars, just as the observation of exact Zipf-like behavior cannot distinguish between the underlying processes of tossing an MM sided die or a finite-state branching process.Comment: compressed uuencoded postscript file: 14 page

    Optimal coding and the origins of Zipfian laws

    Full text link
    The problem of compression in standard information theory consists of assigning codes as short as possible to numbers. Here we consider the problem of optimal coding -- under an arbitrary coding scheme -- and show that it predicts Zipf's law of abbreviation, namely a tendency in natural languages for more frequent words to be shorter. We apply this result to investigate optimal coding also under so-called non-singular coding, a scheme where unique segmentation is not warranted but codes stand for a distinct number. Optimal non-singular coding predicts that the length of a word should grow approximately as the logarithm of its frequency rank, which is again consistent with Zipf's law of abbreviation. Optimal non-singular coding in combination with the maximum entropy principle also predicts Zipf's rank-frequency distribution. Furthermore, our findings on optimal non-singular coding challenge common beliefs about random typing. It turns out that random typing is in fact an optimal coding process, in stark contrast with the common assumption that it is detached from cost cutting considerations. Finally, we discuss the implications of optimal coding for the construction of a compact theory of Zipfian laws and other linguistic laws.Comment: in press in the Journal of Quantitative Linguistics; definition of concordant pair corrected, proofs polished, references update

    Fast Entropy Estimation for Natural Sequences

    Full text link
    It is well known that to estimate the Shannon entropy for symbolic sequences accurately requires a large number of samples. When some aspects of the data are known it is plausible to attempt to use this to more efficiently compute entropy. A number of methods having various assumptions have been proposed which can be used to calculate entropy for small sample sizes. In this paper, we examine this problem and propose a method for estimating the Shannon entropy for a set of ranked symbolic natural events. Using a modified Zipf-Mandelbrot-Li law and a new rank-based coincidence counting method, we propose an efficient algorithm which enables the entropy to be estimated with surprising accuracy using only a small number of samples. The algorithm is tested on some natural sequences and shown to yield accurate results with very small amounts of data

    Regimes in Babel are Confirmed: Report on Findings in Several Indonesian Ethnic Biblical Texts

    Get PDF
    The paper introduces the presence of three statistical regimes in the Zipfian analysis of texts in quantitative linguistics: the Mandelbrot, original Zipf, and Cancho- Solé-Montemurro regimes. The work is carried out over nine different languages of the same intention semantically: the bible from different languages in Indonesian ethnic and national language. As always, the same analysis is also brought in English version of the Bible for reference. The existence of the three regimes are confirmed while in advance the length of the texts are also becomes an important issue. We outline some further works regarding the quantitative analysis for parameterization used to analyze the three regimes and the task to have broad explanation, especially the microstructure of the language in human decision or linguistic effort – emerging the robustness of them

    Two Universality Properties Associated with the Monkey Model of Zipf's Law

    Full text link
    The distribution of word probabilities in the monkey model of Zipf's law is associated with two universality properties: (1) the power law exponent converges strongly to 1-1 as the alphabet size increases and the letter probabilities are specified as the spacings from a random division of the unit interval for any distribution with a bounded density function on [0,1][0,1]; and (2), on a logarithmic scale the version of the model with a finite word length cutoff and unequal letter probabilities is approximately normally distributed in the part of the distribution away from the tails. The first property is proved using a remarkably general limit theorem for the logarithm of sample spacings from Shao and Hahn, and the second property follows from Anscombe's central limit theorem for a random number of i.i.d. random variables. The finite word length model leads to a hybrid Zipf-lognormal mixture distribution closely related to work in other areas.Comment: 14 pages, 3 figure

    When do finite sample effects significantly affect entropy estimates ?

    Full text link
    An expression is proposed for determining the error caused on entropy estimates by finite sample effects. This expression is based on the Ansatz that the ranked distribution of probabilities tends to follow an empirical Zipf law.Comment: 10 pages, 2 figure

    Are citations of scientific papers a case of nonextensivity ?

    Full text link
    The distribution N(x)N(x) of citations of scientific papers has recently been illustrated (on ISI and PRE data sets) and analyzed by Redner [Eur. Phys. J. B {\bf 4}, 131 (1998)]. To fit the data, a stretched exponential (N(x)exp(x/x0)βN(x) \propto \exp{-(x/x_0)^{\beta}}) has been used with only partial success. The success is not complete because the data exhibit, for large citation count xx, a power law (roughly N(x)x3N(x) \propto x^{-3} for the ISI data), which, clearly, the stretched exponential does not reproduce. This fact is then attributed to a possibly different nature of rarely cited and largely cited papers. We show here that, within a nonextensive thermostatistical formalism, the same data can be quite satisfactorily fitted with a single curve (namely, N(x)1/[1+(q1)λx]q/q1N(x) \propto 1/[1+(q-1) \lambda x]^{q/{q-1}} for the available values of xx. This is consistent with the connection recently established by Denisov [Phys. Lett. A {\bf 235}, 447 (1997)] between this nonextensive formalism and the Zipf-Mandelbrot law. What the present analysis ultimately suggests is that, in contrast to Redner's conclusion, the phenomenon might essentially be one and the same along the entire range of the citation number xx.Comment: Revtex,1 Figure postscript;[email protected]
    corecore