68,813 research outputs found

    Phylo.io: Interactive Viewing and Comparison of Large Phylogenetic Trees on the Web.

    Get PDF
    Phylogenetic trees are pervasively used to depict evolutionary relationships. Increasingly, researchers need to visualize large trees and compare multiple large trees inferred for the same set of taxa (reflecting uncertainty in the tree inference or genuine discordance among the loci analyzed). Existing tree visualization tools are however not well suited to these tasks. In particular, side-by-side comparison of trees can prove challenging beyond a few dozen taxa. Here, we introduce Phylo.io, a web application to visualize and compare phylogenetic trees side-by-side. Its distinctive features are: highlighting of similarities and differences between two trees, automatic identification of the best matching rooting and leaf order, scalability to large trees, high usability, multiplatform support via standard HTML5 implementation, and possibility to store and share visualizations. The tool can be freely accessed at http://phylo.io and can easily be embedded in other web servers. The code for the associated JavaScript library is available at https://github.com/DessimozLab/phylo-io under an MIT open source license

    Genome Trees from Conservation Profiles

    Get PDF
    The concept of the genome tree depends on the potential evolutionary significance in the clustering of species according to similarities in the gene content of their genomes. In this respect, genome trees have often been identified with species trees. With the rapid expansion of genome sequence data it becomes of increasing importance to develop accurate methods for grasping global trends for the phylogenetic signals that mutually link the various genomes. We therefore derive here the methodological concept of genome trees based on protein conservation profiles in multiple species. The basic idea in this derivation is that the multi-component “presence-absence” protein conservation profiles permit tracking of common evolutionary histories of genes across multiple genomes. We show that a significant reduction in informational redundancy is achieved by considering only the subset of distinct conservation profiles. Beyond these basic ideas, we point out various pitfalls and limitations associated with the data handling, paving the way for further improvements. As an illustration for the methods, we analyze a genome tree based on the above principles, along with a series of other trees derived from the same data and based on pair-wise comparisons (ancestral duplication-conservation and shared orthologs). In all trees we observe a sharp discrimination between the three primary domains of life: Bacteria, Archaea, and Eukarya. The new genome tree, based on conservation profiles, displays a significant correspondence with classically recognized taxonomical groupings, along with a series of departures from such conventional clusterings

    Evolutionary distances in the twilight zone -- a rational kernel approach

    Get PDF
    Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly. Alignment-free methods, on the other hand, are based on abstract string comparisons and avoid potential alignment problems. However, in general they are not biologically motivated and ignore our knowledge about the evolution of sequences. Thus, it is still a major open question how to define an evolutionary distance metric between divergent sequences that makes use of indel information and known substitution models without the need for a multiple alignment. Here we propose a new evolutionary distance metric to close this gap. It uses finite-state transducers to create a biologically motivated similarity score which models substitutions and indels, and does not depend on a multiple sequence alignment. The sequence similarity score is defined in analogy to pairwise alignments and additionally has the positive semi-definite property. We describe its derivation and show in simulation studies and real-world examples that it is more accurate in reconstructing phylogenies than competing methods. The result is a new and accurate way of determining evolutionary distances in and beyond the twilight zone of sequence alignments that is suitable for large datasets.Comment: to appear in PLoS ON

    On the inference of large phylogenies with long branches: How long is too long?

    Get PDF
    Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from O(logn)O(\log n) to poly(n)\hbox{poly}(n), where nn is the number of leaves) at the "critical" branch length \critmlq (if it exists) of the ancestral reconstruction problem. Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length O(log(n))O(\log(n)) when the branch lengths are below \critksq, known as the Kesten-Stigum (KS) bound. Although for the CFN model \critmlq = \critksq, it is known that for the more general GTR models one has \critmlq \geq \critksq with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models QQ and a phylogenetic reconstruction algorithm which recovers the tree from O(logn)O(\log n)-length sequences for some branch lengths in the range (\critksq,\critmlq). Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above \critmlq

    Optimisation of ensemble classifiers using genetic algorithm

    Get PDF

    Multivariate Approaches to Classification in Extragalactic Astronomy

    Get PDF
    Clustering objects into synthetic groups is a natural activity of any science. Astrophysics is not an exception and is now facing a deluge of data. For galaxies, the one-century old Hubble classification and the Hubble tuning fork are still largely in use, together with numerous mono-or bivariate classifications most often made by eye. However, a classification must be driven by the data, and sophisticated multivariate statistical tools are used more and more often. In this paper we review these different approaches in order to situate them in the general context of unsupervised and supervised learning. We insist on the astrophysical outcomes of these studies to show that multivariate analyses provide an obvious path toward a renewal of our classification of galaxies and are invaluable tools to investigate the physics and evolution of galaxies.Comment: Open Access paper. http://www.frontiersin.org/milky\_way\_and\_galaxies/10.3389/fspas.2015.00003/abstract\>. \<10.3389/fspas.2015.00003 \&g
    corecore