365 research outputs found

    GeneRax: A Tool for Species-Tree-Aware Maximum Likelihood-Based Gene Family Tree Inference under Gene Duplication, Transfer, and Loss

    Get PDF
    Inferring phylogenetic trees for individual homologous gene families is difficult because alignments are often too short, and thus contain insufficient signal, while substitution models inevitably fail to capture the complexity of the evolutionary processes. To overcome these challenges, species-tree-aware methods also leverage information from a putative species tree. However, only few methods are available that implement a full likelihood framework or account for horizontal gene transfers. Furthermore, these methods often require expensive data preprocessing (e.g., computing bootstrap trees) and rely on approximations and heuristics that limit the degree of tree space exploration. Here, we present GeneRax, the first maximum likelihood species-tree-aware phylogenetic inference software. It simultaneously accounts for substitutions at the sequence level as well as gene level events, such as duplication, transfer, and loss relying on established maximum likelihood optimization algorithms. GeneRax can infer rooted phylogenetic trees for multiple gene families, directly from the per-gene sequence alignments and a rooted, yet undated, species tree. We show that compared with competing tools, on simulated data GeneRax infers trees that are the closest to the true tree in 90% of the simulations in terms of relative Robinson–Foulds distance. On empirical data sets, GeneRax is the fastest among all tested methods when starting from aligned sequences, and it infers trees with the highest likelihood score, based on our model. GeneRax completed tree inferences and reconciliations for 1,099 Cyanobacteria families in 8 min on 512 CPU cores. Thus, its parallelization scheme enables large-scale analyses. GeneRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax (last accessed June 17, 2020)

    FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments

    Get PDF
    Background: We recently described FastTree, a tool for inferring phylogenies for alignments with up to hundreds of thousands of sequences. Here, we describe improvements to FastTree that improve its accuracy without sacrificing scalability. Methodology/Principal Findings: Where FastTree 1 used nearest-neighbor interchanges (NNIs) and the minimum-evolution criterion to improve the tree, FastTree 2 adds minimum-evolution subtree-pruning-regrafting (SPRs) and maximumlikelihood NNIs. FastTree 2 uses heuristics to restrict the search for better trees and estimates a rate of evolution for each site (the ‘‘CAT’ ’ approximation). Nevertheless, for both simulated and genuine alignments, FastTree 2 is slightly more accurate than a standard implementation of maximum-likelihood NNIs (PhyML 3 with default settings). Although FastTree 2 is not quite as accurate as methods that use maximum-likelihood SPRs, most of the splits that disagree are poorly supported, and for large alignments, FastTree 2 is 100–1,000 times faster. FastTree 2 inferred a topology and likelihood-based local support values for 237,882 distinct 16S ribosomal RNAs on a desktop computer in 22 hours and 5.8 gigabytes of memory. Conclusions/Significance: FastTree 2 allows the inference of maximum-likelihood phylogenies for huge alignments

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    Algorithms, load balancing strategies, and dynamic kernels for large-scale phylogenetic tree inference under Maximum Likelihood

    Get PDF
    Phylogenetik, die Analyse der evolutionären Beziehungen zwischen biologischen Einheiten, spielt eine wesentliche Rolle in der biologischen und medizinischen Forschung. Ihre Anwendungen reichen von der Beantwortung grundlegender Fragen, wie der nach dem Ursprungs des Lebens, bis hin zur Lösung praktischer Probleme, wie der Verfolgung von Pandemien in Echtzeit. Heutzutage werden Phylogenetische Bäume typischerweise anhand molekularer Daten über wahrscheinlichkeitsbasierte Methoden berechnet. Diese Verfahren suchen nach demjenigen Stammbaum, welcher eine Likelihood-basierte Bewertungsfunktion unter einem gegebenen stochastischen Modell der Sequenzevolution maximiert. Die vorliegende Arbeit konzentriert sich auf die Inferenz Phylogenetischer Bäume von Arten sowie Genen. Arten entwickeln sich durch Artbildungs- und Aussterbeereignisse. Gene entwickeln sich durch Ereignisse wie Genduplikation, Genverlust und horizontalen Gentransfer. Beide Ausprägungen der Evolution hängen miteinander zusammen, da Gene zu Arten gehören und sich innerhalb des Genoms der Arten entwickeln. Man kann Modelle der Gen-Evolution einsetzen, welche diesen Zusammenhang zwischen der Evolutionsgeschichte von Arten und Genen berücksichtigen, um die Genauigkeit phylogenetischer Baumsuchen zu verbessern. Die klassischen Methoden der phylogenetischen Inferenz ignorieren diese Phänomene und basieren ausschlie\ss lich auf Modellen der Sequenz-Evolution. Darüber hinaus sind aktuelle Maximum-Likelihood-Verfahren rechenaufwendig. Dies stellt eine große Herausforderung dar, zumal aufgrund der Fortschritte in der Sequenzierungstechnologie immer mehr molekulare Daten verfügbar werden und somit die verfügbare Datenmenge drastisch anwächst. Um diese Datenlawine zu bewältigen, benötigt die biologische Forschung dringend Werkzeuge, welche schnellere Algorithmen sowie effiziente parallele Implementierungen zur Verfügung stellen. In dieser Arbeit entwickle ich neue Maximum-Likelihood Methoden, welche auf einer expliziten Modellierung der gemeinsamen Evolutionsgeschichte von Arten und Genen basieren, um genauere phylogenetische Bäume abzuleiten. Außerdem implementiere ich neue Heuristiken und spezifische Parallelisierungsschemata um den Inferenzprozess zu beschleunigen. Mein erstes Projekt, ParGenes, ist eine parallele Softwarepipeline zum Ableiten von Genstammbäumen aus einer Menge genspezifischer Multipler Sequenzalignments. Für jedes Eingabealignment bestimmt ParGenes zunächst das am besten geeignete Modell der Sequenzevolution und sucht anschließend nach dem Genstammbaum mit der höchsten Likelihood unter diesem Modell. Dies erfolgt anhand von Methoden, welche dem aktuellen Stand der Wissenschaft entsprechen, parallel ausgeführt werden können und sich einer neuartigen Lastverteilungsstrategie bedienen. Mein zweites Projekt, SpeciesRax, ist eine Methode zum Ableiten eines gewurzelten Artenbaums aus einer Menge entsprechender ungewurzelter Genstammbäume. Berücksichtigt wird die Evolution eines Gens unter Genduplikation, Genverlust und horizontalem Gentransfer. SpeciesRax sucht den gewurzelten Artenbaum, der die Likelihood-basierte Bewertungsfunktion unter diesem Modell maximiert. Darüber hinaus führe ich eine neue Methode zur Berechnung von Konfidenzwerten auf den Kanten des resultierenden Artenbaumes ein und eine weitere Methode zur Schätzung der Kantenlängen des Artenbaumes. Mein drittes Projekt, GeneRax, ist eine neuartige Maximum-Likelihood-Methode zur Inferenz von Genstammbäumen. GeneRax liest als Eingabe einen gewurzelten Artenbaum sowie eine Menge genspezifischer Multipler Sequenz-Alignments und berechnet als Ausgabe einen Genstammbaum pro Eingabealignment. Dazu führe ich die sogenannte Joint Likelihood-Funktion ein, welche ein Modell der Sequenzevolution mit einem Modell der Genevolution kombiniert. Darüber hinaus kann GeneRax die Abfolge von Genduplikationen, Genverlusten und horizontalen Gentransfers abschätzen, die entlang des Eingabeartenbaums aufgetreten sind

    New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0

    Get PDF
    PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phym

    Neighborhoods of trees in circular orderings

    Get PDF
    In phylogenetics, a common strategy used to construct an evolutionary tree for a set of species X is to search in the space of all such trees for one that optimizes some given score function (such as the minimum evolution, parsimony or likelihood score). As this can be computationally intensive, it was recently proposed to restrict such searches to the set of all those trees that are compatible with some circular ordering of the set X. To inform the design of efficient algorithms to perform such searches, it is therefore of interest to find bounds for the number of trees compatible with a fixed ordering in the neighborhood of a tree that is determined by certain tree operations commonly used to search for trees: the nearest neighbor interchange (nni), the subtree prune and regraft (spr) and the tree bisection and reconnection (tbr) operations. We show that the size of such a neighborhood of a binary tree associated with the nni operation is independent of the tree’s topology, but that this is not the case for the spr and tbr operations. We also give tight upper and lower bounds for the size of the neighborhood of a binary tree for the spr and tbr operations and characterize those trees for which these bounds are attained

    A LASSO-based approach to sample sites for phylogenetic tree search

    Get PDF
    Motivation In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. Results Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance
    • …
    corecore