49 research outputs found

    Descriptive Complexity Approaches to Inductive Inference

    Get PDF
    We present a critical review of descriptive complexity approaches to inductive inference. Inductive inference is defined as any process by which a model of the world is formed from observations. The descriptive complexity approach is a formalization of Occam\u27s razor: choose the simplest model consistent with the data. Descriptive complexity as defined by Kolmogorov, Chaitin and Solomonoff is presented as a generalization of Shannon\u27s entropy. We discuss its relationship with randomness and present examples. However, a major result of the theory is negative: descriptive complexity is uncomputable. Rissanen\u27s minimum description length (MDL) principle is presented as a restricted form of the descriptive complexity which avoids the uncomputability problem. We demonstrate the effectiveness of MDL through its application to AR processes. Lastly, we present and discuss LeClerc\u27s application of MDL to the problem of image segmentation

    A Mathematical Formalism of Infinite Coding for the Compression of Stochastic Process

    Get PDF
    As mentioned in [5, page 6], there are two basic models for sources of data in information theory: finite length sources, that is, sources which produce finite length strings, and infinite length sources, which produce infinite length strings. Finite length sources provide a better model for files, for instance, since files consist of finite length strings of symbols. Infinite length sources provide a better model for communication lines which provide a string of symbols which, if not infinite, typically have no readily apparent end. In fact, even in some cases in which the data is finite, it is convenient to use the infinite length source model. For instance, the widely used adaptive coding techniques (see, for instance [5]) typically use arithmetic coding which implicitly assumes an infinite length source (although practical implementations make modifications so that it may be used with finite length strings). In this paper, we formalize the notion of encoding an infinite length source. While such infinite codes are used intuitively throughout the literature, their mathematical formalization reveals certain subtleties which might otherwise be overlooked. For instance, it turns out that the pure arithmetic code for certain sources has not only unbounded but infinite delay (that is, it is necessary to see a complete infinite source string before being able to determine even one bit of the encoded string in certain cases). Fortunately, such cases occur with zero probability. The formalization presented here leads to a better understanding of infinite coding and a methodology for designing better infinite codes for adaptive data compression (see [1])

    Structural Analysis of Biodiversity

    Get PDF
    Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity

    Alignment-Free Phylogenetic Reconstruction

    Get PDF
    14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, April 25-28, 2010. ProceedingsWe introduce the first polynomial-time phylogenetic reconstruction algorithm under a model of sequence evolution allowing insertions and deletions (or indels). Given appropriate assumptions, our algorithm requires sequence lengths growing polynomially in the number of leaf taxa. Our techniques are distance-based and largely bypass the problem of multiple alignment

    Rec-DCM-Eigen: Reconstructing a Less Parsimonious but More Accurate Tree in Shorter Time

    Get PDF
    Maximum parsimony (MP) methods aim to reconstruct the phylogeny of extant species by finding the most parsimonious evolutionary scenario using the species' genome data. MP methods are considered to be accurate, but they are also computationally expensive especially for a large number of species. Several disk-covering methods (DCMs), which decompose the input species to multiple overlapping subgroups (or disks), have been proposed to solve the problem in a divide-and-conquer way

    Large-Scale Neighbor-Joining with NINJA

    Full text link
    Abstract Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n3) time and O(n2) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative genomics studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained fro

    A formalism for the design of optimal adaptive text data compression rules

    No full text
    Data compression is the transformation of data into representations which are as concise as possible. In particular, noiseless coding is the theory of concisely encoding randomly generated information in such a way that the data can be completely recovered from the encoded data. We present two abstract models of sources of information: the standard finite data model and a new infinite data model. For the finite data model, a technique known as Huffman coding is known to yield the smallest possible average coding length of the transformed data. In the more general infinite data model, the popular technique of arithmetic coding is optimal in a strong sense. Also, we demonstrate that arithmetic coding is practical in the sense that it has finite delay with probability one. In recent years, robust or adaptive data compression techniques have become popular. We present a methodology based upon statistical decision theory for deriving optimal adaptive data compression rules for a given class of stochastic processes. We demonstrate the use of this methodology by finding optimal data compression rules for the class of fixed-order stationary Markov chains with non-zero transition probabilities. The optimal rules for this class involve integrals which cannot be solved in closed form. We present an analysis of rules which are used in practice and compare these with the optimal rules. Finally, we present the results of simulations which coincide well with our asymptotic results. In our conclusions, we make suggestions on how to derive optimal rules for more general classes of stochastic processes such as the class of Markov chains of any order

    The Asymptotic Redundancy of Bayes Rules for Markov Chains

    No full text
    Abstract-- We derive the asymptotics of the redundancy of Bayes rules for Markov chains with known order, extending the work of Barron and Clarke[6, 5] on i.i.d. sources. These asymptotics are derived when the actual source is in the class of OE-mixing sources which includes Markov chains and functions of Markov chains. These results can be used to derive minimax asymptotic rates of convergence for universal codes when a Markov chain of known order is used as a model. Index terms-- universal coding, Markov chains, Bayesian statistics, asymptotics. 1 Introduction Given data generated by a known stochastic process, methods of encoding the data to achieve the minimal average coding length, such as Huffman and arithmetic coding, are known[7]. Universal codes[15, 8] encode data such that, asymptotically, the average per-symbol code length is equal to its minimal value (the entropy rate) for any source within a wide class. For the well-known Lempel-Ziv code, the average per-symbol code l..
    corecore