13 research outputs found

    SpeciesRax:A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss

    Get PDF
    Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informative signal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modeling events by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Using both empirical and simulated data sets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large data sets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188 species from 31,612 gene families in 1 h using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda

    SpeciesRax:A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss

    Get PDF
    Species tree inference from gene family trees is becoming increasingly popular because it can account for discordance between the species tree and the corresponding gene family trees. In particular, methods that can account for multiple-copy gene families exhibit potential to leverage paralogy as informativesignal. At present, there does not exist any widely adopted inference method for this purpose. Here, we present SpeciesRax, the first maximum likelihood method that can infer a rooted species tree from a set of gene family trees and can account for gene duplication, loss, and transfer events. By explicitly modellingevents by which gene trees can depart from the species tree, SpeciesRax leverages the phylogenetic rooting signal in gene trees. SpeciesRax infers species tree branch lengths in units of expected substitutions per site and branch support values via paralogy-aware quartets extracted from the gene family trees. Usingboth empirical and simulated datasets we show that SpeciesRax is at least as accurate as the best competing methods while being one order of magnitude faster on large datasets at the same time. We used SpeciesRax to infer a biologically plausible rooted phylogeny of the vertebrates comprising 188species from 31612 gene families in one hour using 40 cores. SpeciesRax is available under GNU GPL at https://github.com/BenoitMorel/GeneRax and on BioConda.<br/

    Embedding Four-directional Paths on Convex Point Sets

    Get PDF
    Abstract A directed path whose edges are assigned labels &quot;up&quot;, &quot;down&quot;, &quot;right&quot;, or &quot;left&quot; is called four-directional, and three-directional if at most three out of the four labels are used. A direction-consistent embedding of an n-vertex three-or four-directional path P on a set S of n points in the plane is a straight-line drawing of P where each vertex of P is mapped to a distinct point of S and every edge points to the direction specified by its label. We study planar direction-consistent embeddings of three-and four-directional paths and provide a complete picture of the problem for convex point sets

    Parallel String Matching

    Get PDF
    We explore the benefits of parallelizing 7 state-of-the-art string matching algorithms. Using SIMD and multi-threading techniques we achieve a significant performance improvement of up to 43.3x over reference implementations and a speedup of up to 16.7x over the string matching program grep. We evaluate our implementations on the smart-corpora and the full human genome data set. We show scalability over number of threads and impact of pattern length

    Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult

    Get PDF
    Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8,736 out of all 16,453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into subclasses using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution

    NetRAX: Accurate and Fast Maximum Likelihood Phylogenetic Network Inference

    No full text
    International audienceAbstract Phylogenetic networks are used to represent non-treelike evolutionary scenarios. Current, actively developed approaches for phylogenetic network inference jointly account for non-treelike evolution and incomplete lineage sorting (ILS). Unfortunately, this induces a very high computational complexity. Hence, current tools can only analyze small data sets. We present NetRAX, a tool for maximum likelihood inference of phylogenetic networks in the absence of incomplete lineage sorting. Our tool leverages state-of-the-art methods for efficiently computing the phylogenetic likelihood function on trees, and extends them to phylogenetic networks via the notion of “displayed trees”. NetRAX can infer maximum likelihood phylogenetic networks from partitioned multiple sequence alignments and returns the inferred networks in Extended Newick format. On simulated data, our results show a very low relative difference in BIC score and a near-zero unrooted softwired cluster distance to the true, simulated networks. With NetRAX, a network inference on a partitioned alignment with 8, 000 sites, 30 taxa, and 3 reticulations completes within a few minutes on a standard laptop. Our implementation is available under the GNU General Public License v3.0 at https://github.com/lutteropp/NetRAX

    Data from: Quartet-based computations of internode certainty provide robust measures of phylogenetic incongruence

    No full text
    Incongruence, or topological conflict, is prevalent in genome-scale data sets. Internode Certainty (IC) and related measures were recently introduced to explicitly quantify the level of incongruence of a given internal branch among a set of phylogenetic trees and complement regular branch support measures (e.g., bootstrap, posterior probability) that instead assess the statistical confidence of inference. Since most phylogenomic studies contain data partitions (e.g., genes) with missing taxa and IC scores stem from the frequencies of bipartitions (or splits) on a set of trees, IC score calculation typically requires adjusting the frequencies of bipartitions from these partial gene trees. However, when the proportion of missing taxa is high, the scores yielded by current approaches that adjust bipartition frequencies in partial gene trees differ substantially from each other and tend to be overestimates. To overcome these issues, we developed three new IC measures based on the frequencies of quartets, which naturally apply to both complete and partial trees. Comparison of our new quartet-based measures to previous bipartition-based measures on simulated data shows that: 1) on complete data sets, both quartet-based and bipartition-based measures yield very similar IC scores; 2) IC scores of quartet-based measures on a given data set with and without missing taxa are more similar than the scores of bipartition-based measures; and 3) quartet-based measures are more robust to the absence of phylogenetic signal and errors in phylogenetic inference than bipartition-based measures. Additionally, the analysis of an empirical mammalian phylogenomic data set using our quartet-based measures reveals the presence of substantial levels of incongruence for numerous internal branches. An efficient open-source implementation of these quartet-based measures is freely available in the program QuartetScores (https://github.com/lutteropp/QuartetScores)

    Monotone Simultaneous Embeddings of Upward Planar Digraphs

    No full text
    We study monotone simultaneous embeddings of upward planar digraphs, which are simultaneous embeddings where the drawing of each digraph is upward planar, and the directions of the upwardness of different graphs can differ. We first consider the special case where each digraph is a directed path. In contrast to the known result that any two directed paths admit a monotone simultaneous embedding, there exist examples of three paths that do not admit such an embedding for any possible choice of directions of monotonicity. We prove that if a monotone simultaneous embedding of three paths exists then it also exists for any possible choice of directions of monotonicity. We provide a polynomial-time algorithm that, given three paths, decides whether a monotone simultaneous embedding exists and, in the case of existence, also constructs such an embedding. On the other hand, we show that already for three paths, any monotone simultaneous embedding might need a grid whose size is exponential in the number of vertices. For more than three paths, we present a polynomial-time algorithm that, given any number of paths and predefined directions of monotonicity, decides whether the paths admit a monotone simultaneous embedding with respect to the given directions, including the construction of a solution if it exists. Further, we show several implications of our results on monotone simultaneous embeddings of general upward planar digraphs. Finally, we discuss complexity issues related to our problems
    corecore