Search CORE

49 research outputs found

Descriptive Complexity Approaches to Inductive Inference

Author: Atteson Kevin
Publication venue: ScholarlyCommons
Publication date: 01/01/1991
Field of study

We present a critical review of descriptive complexity approaches to inductive inference. Inductive inference is defined as any process by which a model of the world is formed from observations. The descriptive complexity approach is a formalization of Occam\u27s razor: choose the simplest model consistent with the data. Descriptive complexity as defined by Kolmogorov, Chaitin and Solomonoff is presented as a generalization of Shannon\u27s entropy. We discuss its relationship with randomness and present examples. However, a major result of the theory is negative: descriptive complexity is uncomputable. Rissanen\u27s minimum description length (MDL) principle is presented as a restricted form of the descriptive complexity which avoids the uncomputability problem. We demonstrate the effectiveness of MDL through its application to AR processes. Lastly, we present and discuss LeClerc\u27s application of MDL to the problem of image segmentation

ScholarlyCommons@Penn

A Mathematical Formalism of Infinite Coding for the Compression of Stochastic Process

Author: Atteson Kevin
Publication venue: ScholarlyCommons
Publication date: 25/05/1994
Field of study

As mentioned in [5, page 6], there are two basic models for sources of data in information theory: finite length sources, that is, sources which produce finite length strings, and infinite length sources, which produce infinite length strings. Finite length sources provide a better model for files, for instance, since files consist of finite length strings of symbols. Infinite length sources provide a better model for communication lines which provide a string of symbols which, if not infinite, typically have no readily apparent end. In fact, even in some cases in which the data is finite, it is convenient to use the infinite length source model. For instance, the widely used adaptive coding techniques (see, for instance [5]) typically use arithmetic coding which implicitly assumes an infinite length source (although practical implementations make modifications so that it may be used with finite length strings). In this paper, we formalize the notion of encoding an infinite length source. While such infinite codes are used intuitively throughout the literature, their mathematical formalization reveals certain subtleties which might otherwise be overlooked. For instance, it turns out that the pure arithmetic code for certain sources has not only unbounded but infinite delay (that is, it is necessary to see a complete infinite source string before being able to determine even one bit of the encoded string in certain cases). Fortunately, such cases occur with zero probability. The formalization presented here leads to a better understanding of infinite coding and a methodology for designing better infinite codes for adaptive data compression (see [1])

ScholarlyCommons@Penn

An algorithm for reconstructing ultrametric tree-child networks from inter-taxa distances

Author: Apostolico
Atteson
Bordewich
Bordewich
Cardona
Chan
Desper
Fitch
Gascuel
Gascuel
M. Bordewich
N. Tokac
Saitou
Semple
Sokal
Willson
Willson
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Structural Analysis of Biodiversity

Author: A Chapman
C Moritz
D Bryant
J Ausubel
J de Rosnay
K Atteson
K Kerr
K Kimura
K Petersen
K Tamura
K Tamura
L Sirovich
Lawrence Sirovich
M Caterino
M Hasegawa
M Kallersjo
M McMahon
M Sanderson
Mark Y. Stoeckle
Mukund Thattai
N Eldredge
N Saitou
O Gascuel
P Goloboff
P Hebert
R Mihaescu
S Hackett
S Smith
T Castoe
Yu Zhang
Publication venue: Public Library of Science
Publication date: 01/02/2010
Field of study

Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Alignment-Free Phylogenetic Reconstruction

Author: A. Loytynoja
B.D. Thatte
C. Daskalakis
C. Daskalakis
C. Daskalakis
C. Semple
D. Graur
D. Metzler
D.G. Higgins
E. Mossel
E. Mossel
I. Elias
I. Gronau
I. Miklos
J. Felsenstein
J.L. Thorne
J.L. Thorne
K. Atteson
K. Katoh
K. Liu
K.B. Athreya
K.M. Wong
L. Wang
M. Csurös
M. Csurös
M. Hohl
M.A. Steel
M.A. Steel
M.A. Suchard
M.R. Lacey
P. Buneman
P.L. Erdös
P.L. Erdös
R.C. Edgar
S. Karlin
V. King
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, April 25-28, 2010. ProceedingsWe introduce the first polynomial-time phylogenetic reconstruction algorithm under a model of sequence evolution allowing insertions and deletions (or indels). Given appropriate assumptions, our algorithm requires sequence lengths growing polynomially in the number of leaf taxa. Our techniques are distance-based and largely bypass the problem of multiple alignment

CiteSeerX

DSpace@MIT

Crossref

Rec-DCM-Eigen: Reconstructing a Less Parsimonious but More Accurate Tree in Shorter Time

Author: A Bhutkar
A Coghlan
A Coghlan
A Pothen
A Wei Xu
B Mohar
BME Moret
BME Moret
BME Moret
BME Moret
CA Stewart
Christian Schönbach
D Sankoff
DA Bader
David A. Bader
DH Huson
DH Huson
G Bourque
G Fertin
G Li
J Bergsten
J Tang
JA Hartigan
Jijun Tang
K Atteson
KM Swenson
M Bernt
M Blanchette
MD Hendy
MEJ Newman
N Saitou
ND Pattengale
Seunghwa Kang
Stephen W. Schaeffer
U von Luxburg
UW Roshan
W Arndt
WM Fitch
Y Lin
Y Lin
Publication venue: Public Library of Science
Publication date
Field of study

Maximum parsimony (MP) methods aim to reconstruct the phylogeny of extant species by finding the most parsimonious evolutionary scenario using the species' genome data. MP methods are considered to be accurate, but they are also computationally expensive especially for a large number of species. Several disk-covering methods (DCMs), which decompose the input species to multiple overlapping subgroups (or disks), have been proposed to solve the problem in a divide-and-conquer way

Crossref

Directory of Open Access Journals

PubMed Central

Large-Scale Neighbor-Joining with NINJA

Author: D. Bryant
D.A. Patterson
I. Elias
J. Evans
J.A. Studier
K. Atteson
K. Brengel
K. Howe
L. Sheneman
L. Zaslavsky
M. Simonsen
M.N. Price
N. Goldman
N. Saitou
O. Gascuel
R. Bayer
R. Desper
R.D. Finn
S. Griffiths Jones
S.A. Smith
T. Mailund
T. Mailund
T.H. Corman
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Abstract Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n3) time and O(n2) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative genomics studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained fro

CiteSeerX

Crossref

A formalism for the design of optimal adaptive text data compression rules

Author: Atteson Kevin Scott
Publication venue: ScholarlyCommons
Publication date: 01/01/1995
Field of study

Data compression is the transformation of data into representations which are as concise as possible. In particular, noiseless coding is the theory of concisely encoding randomly generated information in such a way that the data can be completely recovered from the encoded data. We present two abstract models of sources of information: the standard finite data model and a new infinite data model. For the finite data model, a technique known as Huffman coding is known to yield the smallest possible average coding length of the transformed data. In the more general infinite data model, the popular technique of arithmetic coding is optimal in a strong sense. Also, we demonstrate that arithmetic coding is practical in the sense that it has finite delay with probability one. In recent years, robust or adaptive data compression techniques have become popular. We present a methodology based upon statistical decision theory for deriving optimal adaptive data compression rules for a given class of stochastic processes. We demonstrate the use of this methodology by finding optimal data compression rules for the class of fixed-order stationary Markov chains with non-zero transition probabilities. The optimal rules for this class involve integrals which cannot be solved in closed form. We present an analysis of rules which are used in practice and compare these with the optimal rules. Finally, we present the results of simulations which coincide well with our asymptotic results. In our conclusions, we make suggestions on how to derive optimal rules for more general classes of stochastic processes such as the class of Markov chains of any order

ScholarlyCommons@Penn

Recommended from our members

Final technical report: analysis of molecular data using statistical and evolutionary approaches

Author: Atteson K.
Kim Junhyong
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 15/02/2000
Field of study

This document describes the research and training accomplishments of Dr. Kevin Atteson during the DOE fellowship period of September 1997 to September 1999. Dr. Atteson received training in molecular evolution during this period and made progress on seven research topics including: computation of DNA pattern probability, asymptotic redundancy of Bayes rules, performance of neighbor-joining evolutionary tree estimation, convex evolutionary tree estimation, identifiability of trees under mixed rates, gene expression analysis, and population genetics of unequal crossover

UNT Digital Library

The Asymptotic Redundancy of Bayes Rules for Markov Chains

Author: Kevin Atteson
Publication venue
Publication date
Field of study

Abstract-- We derive the asymptotics of the redundancy of Bayes rules for Markov chains with known order, extending the work of Barron and Clarke[6, 5] on i.i.d. sources. These asymptotics are derived when the actual source is in the class of OE-mixing sources which includes Markov chains and functions of Markov chains. These results can be used to derive minimax asymptotic rates of convergence for universal codes when a Markov chain of known order is used as a model. Index terms-- universal coding, Markov chains, Bayesian statistics, asymptotics. 1 Introduction Given data generated by a known stochastic process, methods of encoding the data to achieve the minimal average coding length, such as Huffman and arithmetic coding, are known[7]. Universal codes[15, 8] encode data such that, asymptotically, the average per-symbol code length is equal to its minimal value (the entropy rate) for any source within a wide class. For the well-known Lempel-Ziv code, the average per-symbol code l..

CiteSeerX