26,369 research outputs found
Guessing probability distributions from small samples
We propose a new method for the calculation of the statistical properties, as
e.g. the entropy, of unknown generators of symbolic sequences. The probability
distribution of the elements of a population can be approximated by
the frequencies of a sample provided the sample is long enough so that
each element occurs many times. Our method yields an approximation if this
precondition does not hold. For a given we recalculate the Zipf--ordered
probability distribution by optimization of the parameters of a guessed
distribution. We demonstrate that our method yields reliable results.Comment: 10 pages, uuencoded compressed PostScrip
Universal lossless source coding with the Burrows Wheeler transform
The Burrows Wheeler transform (1994) is a reversible sequence transformation used in a variety of practical lossless source-coding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWT-based compression schemes are widely touted as low-complexity algorithms giving lossless coding rates better than those of the Ziv-Lempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWT-based coding. The main results of this theoretical evaluation include: (1) statistical characterizations of the BWT output on both finite strings and sequences of length n â â, (2) a variety of very simple new techniques for BWT-based lossless source coding, and (3) proofs of the universality and bounds on the rates of convergence of both new and existing BWT-based codes for finite-memory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWT-based lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in Ziv-Lempel style codes and, for some BWT-based codes, within a constant factor of the optimal rate of convergence for finite-memory source
Distributions of Triplets in Genetic Sequences
Distributions of triplets in some genetic sequences are examined and found to
be well described by a 2-parameter Markov process with a sparse transition
matrix. The variances of all the relevant parameters are not large, indicating
that most sequences gather in a small region in the parameter space. Different
sequences have very near values of the entropy calculated directly from the
data and the two parameters characterizing the Markov process fitting the
sequence. No relevance with taxonomy or coding/noncoding is clearly observed.Comment: revtex, 17pages, 8 figures, submitted to Physica
Entropy and Long range correlations in literary English
Recently long range correlations were detected in nucleotide sequences and in
human writings by several authors. We undertake here a systematic investigation
of two books, Moby Dick by H. Melville and Grimm's tales, with respect to the
existence of long range correlations. The analysis is based on the calculation
of entropy like quantities as the mutual information for pairs of letters and
the entropy, the mean uncertainty, per letter. We further estimate the number
of different subwords of a given length . Filtering out the contributions
due to the effects of the finite length of the texts, we find correlations
ranging to a few hundred letters. Scaling laws for the mutual information
(decay with a power law), for the entropy per letter (decay with the inverse
square root of ) and for the word numbers (stretched exponential growth with
and with a power law of the text length) were found.Comment: 8 page
Approximate entropy of network parameters
We study the notion of approximate entropy within the framework of network
theory. Approximate entropy is an uncertainty measure originally proposed in
the context of dynamical systems and time series. We firstly define a purely
structural entropy obtained by computing the approximate entropy of the so
called slide sequence. This is a surrogate of the degree sequence and it is
suggested by the frequency partition of a graph. We examine this quantity for
standard scale-free and Erd\H{o}s-R\'enyi networks. By using classical results
of Pincus, we show that our entropy measure converges with network size to a
certain binary Shannon entropy. On a second step, with specific attention to
networks generated by dynamical processes, we investigate approximate entropy
of horizontal visibility graphs. Visibility graphs permit to naturally
associate to a network the notion of temporal correlations, therefore providing
the measure a dynamical garment. We show that approximate entropy distinguishes
visibility graphs generated by processes with different complexity. The result
probes to a greater extent these networks for the study of dynamical systems.
Applications to certain biological data arising in cancer genomics are finally
considered in the light of both approaches.Comment: 11 pages, 5 EPS figure
Surveying structural complexity in quantum many-body systems
Quantum many-body systems exhibit a rich and diverse range of exotic
behaviours, owing to their underlying non-classical structure. These systems
present a deep structure beyond those that can be captured by measures of
correlation and entanglement alone. Using tools from complexity science, we
characterise such structure. We investigate the structural complexities that
can be found within the patterns that manifest from the observational data of
these systems. In particular, using two prototypical quantum many-body systems
as test cases - the one-dimensional quantum Ising and Bose-Hubbard models - we
explore how different information-theoretic measures of complexity are able to
identify different features of such patterns. This work furthers the
understanding of fully-quantum notions of structure and complexity in quantum
systems and dynamics.Comment: 9 pages, 5 figure
Correction algorithm for finite sample statistics
Assume in a sample of size M one finds M_i representatives of species i with
i=1...N^*. The normalized frequency p^*_i=M_i/M, based on the finite sample,
may deviate considerably from the true probabilities p_i. We propose a method
to infer rank-ordered true probabilities r_i from measured frequencies M_i. We
show that the rank-ordered probabilities provide important informations on the
system, e.g., the true number of species, the Shannon- and the Renyi-entropies.Comment: 11 pages, 9 figure
- âŠ