26,369 research outputs found

    Guessing probability distributions from small samples

    Full text link
    We propose a new method for the calculation of the statistical properties, as e.g. the entropy, of unknown generators of symbolic sequences. The probability distribution p(k)p(k) of the elements kk of a population can be approximated by the frequencies f(k)f(k) of a sample provided the sample is long enough so that each element kk occurs many times. Our method yields an approximation if this precondition does not hold. For a given f(k)f(k) we recalculate the Zipf--ordered probability distribution by optimization of the parameters of a guessed distribution. We demonstrate that our method yields reliable results.Comment: 10 pages, uuencoded compressed PostScrip

    Universal lossless source coding with the Burrows Wheeler transform

    Get PDF
    The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n → ∞, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source

    Distributions of Triplets in Genetic Sequences

    Get PDF
    Distributions of triplets in some genetic sequences are examined and found to be well described by a 2-parameter Markov process with a sparse transition matrix. The variances of all the relevant parameters are not large, indicating that most sequences gather in a small region in the parameter space. Different sequences have very near values of the entropy calculated directly from the data and the two parameters characterizing the Markov process fitting the sequence. No relevance with taxonomy or coding/noncoding is clearly observed.Comment: revtex, 17pages, 8 figures, submitted to Physica

    Entropy and Long range correlations in literary English

    Full text link
    Recently long range correlations were detected in nucleotide sequences and in human writings by several authors. We undertake here a systematic investigation of two books, Moby Dick by H. Melville and Grimm's tales, with respect to the existence of long range correlations. The analysis is based on the calculation of entropy like quantities as the mutual information for pairs of letters and the entropy, the mean uncertainty, per letter. We further estimate the number of different subwords of a given length nn. Filtering out the contributions due to the effects of the finite length of the texts, we find correlations ranging to a few hundred letters. Scaling laws for the mutual information (decay with a power law), for the entropy per letter (decay with the inverse square root of nn) and for the word numbers (stretched exponential growth with nn and with a power law of the text length) were found.Comment: 8 page

    Approximate entropy of network parameters

    Get PDF
    We study the notion of approximate entropy within the framework of network theory. Approximate entropy is an uncertainty measure originally proposed in the context of dynamical systems and time series. We firstly define a purely structural entropy obtained by computing the approximate entropy of the so called slide sequence. This is a surrogate of the degree sequence and it is suggested by the frequency partition of a graph. We examine this quantity for standard scale-free and Erd\H{o}s-R\'enyi networks. By using classical results of Pincus, we show that our entropy measure converges with network size to a certain binary Shannon entropy. On a second step, with specific attention to networks generated by dynamical processes, we investigate approximate entropy of horizontal visibility graphs. Visibility graphs permit to naturally associate to a network the notion of temporal correlations, therefore providing the measure a dynamical garment. We show that approximate entropy distinguishes visibility graphs generated by processes with different complexity. The result probes to a greater extent these networks for the study of dynamical systems. Applications to certain biological data arising in cancer genomics are finally considered in the light of both approaches.Comment: 11 pages, 5 EPS figure

    Surveying structural complexity in quantum many-body systems

    Full text link
    Quantum many-body systems exhibit a rich and diverse range of exotic behaviours, owing to their underlying non-classical structure. These systems present a deep structure beyond those that can be captured by measures of correlation and entanglement alone. Using tools from complexity science, we characterise such structure. We investigate the structural complexities that can be found within the patterns that manifest from the observational data of these systems. In particular, using two prototypical quantum many-body systems as test cases - the one-dimensional quantum Ising and Bose-Hubbard models - we explore how different information-theoretic measures of complexity are able to identify different features of such patterns. This work furthers the understanding of fully-quantum notions of structure and complexity in quantum systems and dynamics.Comment: 9 pages, 5 figure

    Correction algorithm for finite sample statistics

    Full text link
    Assume in a sample of size M one finds M_i representatives of species i with i=1...N^*. The normalized frequency p^*_i=M_i/M, based on the finite sample, may deviate considerably from the true probabilities p_i. We propose a method to infer rank-ordered true probabilities r_i from measured frequencies M_i. We show that the rank-ordered probabilities provide important informations on the system, e.g., the true number of species, the Shannon- and the Renyi-entropies.Comment: 11 pages, 9 figure
    • 

    corecore