23 research outputs found
Structural Analysis of Biodiversity
Large, recently-available genomic databases cover a wide range of life forms, suggesting opportunity for insights into genetic structure of biodiversity. In this study we refine our recently-described technique using indicator vectors to analyze and visualize nucleotide sequences. The indicator vector approach generates correlation matrices, dubbed Klee diagrams, which represent a novel way of assembling and viewing large genomic datasets. To explore its potential utility, here we apply the improved algorithm to a collection of almost 17000 DNA barcode sequences covering 12 widely-separated animal taxa, demonstrating that indicator vectors for classification gave correct assignment in all 11000 test cases. Indicator vector analysis revealed discontinuities corresponding to species- and higher-level taxonomic divisions, suggesting an efficient approach to classification of organisms from poorly-studied groups. As compared to standard distance metrics, indicator vectors preserve diagnostic character probabilities, enable automated classification of test sequences, and generate high-information density single-page displays. These results support application of indicator vectors for comparative analysis of large nucleotide data sets and raise prospect of gaining insight into broad-scale patterns in the genetic structure of biodiversity
Alignment-Free Phylogenetic Reconstruction
14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, April 25-28, 2010. ProceedingsWe introduce the first polynomial-time phylogenetic reconstruction algorithm under a model of sequence evolution allowing insertions and deletions (or indels). Given appropriate assumptions, our algorithm requires sequence lengths growing polynomially in the number of leaf taxa. Our techniques are distance-based and largely bypass the problem of multiple alignment
Large-Scale Neighbor-Joining with NINJA
Abstract Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n3) time and O(n2) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative genomics studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained fro
Rec-DCM-Eigen: Reconstructing a Less Parsimonious but More Accurate Tree in Shorter Time
Maximum parsimony (MP) methods aim to reconstruct the phylogeny of extant species by finding the most parsimonious evolutionary scenario using the species' genome data. MP methods are considered to be accurate, but they are also computationally expensive especially for a large number of species. Several disk-covering methods (DCMs), which decompose the input species to multiple overlapping subgroups (or disks), have been proposed to solve the problem in a divide-and-conquer way
Recommended from our members
Final technical report: analysis of molecular data using statistical and evolutionary approaches
This document describes the research and training accomplishments of Dr. Kevin Atteson during the DOE fellowship period of September 1997 to September 1999. Dr. Atteson received training in molecular evolution during this period and made progress on seven research topics including: computation of DNA pattern probability, asymptotic redundancy of Bayes rules, performance of neighbor-joining evolutionary tree estimation, convex evolutionary tree estimation, identifiability of trees under mixed rates, gene expression analysis, and population genetics of unequal crossover
Exact-IEBP: A New Technique For Estimating Evolutionary Distances Between Whole Genomes
Evolution operates on whole genomes by operations that change the order and strandedness of genes within the genomes. This type of data presents new opportunities for discoveries about deep evolutionary rearrangement events, provided that suciently accurate methods can be developed to reconstruct evolutionary trees in these models [3, 11, 13, 18]. A necessary component of any such method is the ability to accurately estimate the true evolutionary distance between two genomes, which is the number of rearrangement events that took place in the evolutionary history between them. We improve the technique (IEBP) in [21] with a new method, Exact-IEBP, for estimating the true evolutionary distance between two signed genomes. Our simulation study shows Exact-IEBP is a better estimation of true evolutionary distances. Furthermore, Exact-IEBP produces more accurate trees than IEBP when used with the popular distance-based method, neighbor joining [16]
The Performance of Phylogenetic Methods on Trees of Bounded Diameter
We study the convergence rates of neighbor-joining and several new phylogenetic reconstruction methods on families of trees of bounded diameter. Our study presents theoretically obtained convergence rates, as well as an empirical study based upon simulation of evolution on random birth-death trees. We find that the new phylogenetic methods offer an advantage over the neighborjoining method, except at low rates of evolution where they have comparable performance. The improvement in performance of the new methods over neighborjoining increases with the number of taxa and the rate of evolution
Sequence-Length Requirements for Phylogenetic Methods
We study the sequence lengths required by neighbor-joining, greedy parsimony, and a phylogenetic reconstruction method (DCM NJ +MP) based on disk-covering and the maximum parsimony criterion. We use extensive simulations based on random birth-death trees, with controlled deviations from ultrametricity, to collect data on the scaling of sequence-length requirements for each of the three methods as a function of the number of taxa, the rate of evolution on the tree, and the deviation from ultrametricity. Our experiments show that DCM NJ +MP has consistently lower sequence-length requirements than the other two methods when trees of high topological accuracy are desired, although all methods require much longer sequences as the deviation from ultrametricity or the height of the tree grows. Our study has significant implications for large-scale phylogenetic reconstruction (where sequence-length requirements are a crucial factor), but also for future performance analyses in phylogenetics (since deviations from ultrametricity are proving pivotal)