40 research outputs found
TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees
Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1
Gene tree correction guided by orthology
International audienceBackgroundReconciled gene trees yield orthology and paralogy relationships between genes. This information may however contradict other information on orthology and paralogy provided by other footprints of evolution, such as conserved synteny.ResultsWe explore a way to include external information on orthology in the process of gene tree construction. Given an initial gene tree and a set of orthology constraints on pairs of genes or on clades, we give polynomial-time algorithms for producing a modified gene tree satisfying the set of constraints, that is as close as possible to the original one according to the Robinson-Foulds distance. We assess the validity of the modifications we propose by computing the likelihood ratio between initial and modified trees according to sequence alignments on Ensembl trees, showing that often the two trees are statistically equivalent.AvailabilitySoftware and data available upon request to the corresponding author
Correcting Gene Trees by Leaf Insertions: Complexity and Approximation
Abstract Gene tree correction has recently gained interest in phylogenomics, as it gives insights in understanding the evolution of gene families. Following some recent approaches based on leaf edit operations, we consider a variant of the problem where a gene tree is corrected by inserting leaves with labels in a multiset M. We show that the problem of deciding whether a gene tree can be corrected by inserting leaves with labels in M is NP-complete. Then, we consider an optimization variant of the problem that asks for the correction of a gene tree with leaves labeled by a multiset M ′ , with M ′ ⊇ M , having minimum size. For this optimization variant of the problem, we present a factor 2 approximation algorithm
Assessing the robustness of parsimonious predictions for gene neighborhoods from reconciled phylogenies
The availability of a large number of assembled genomes opens the way to
study the evolution of syntenic character within a phylogenetic context. The
DeCo algorithm, recently introduced by B{\'e}rard et al. allows the computation
of parsimonious evolutionary scenarios for gene adjacencies, from pairs of
reconciled gene trees. Following the approach pioneered by Sturmfels and
Pachter, we describe how to modify the DeCo dynamic programming algorithm to
identify classes of cost schemes that generates similar parsimonious
evolutionary scenarios for gene adjacencies, as well as the robustness to
changes to the cost scheme of evolutionary events of the presence or absence of
specific ancestral gene adjacencies. We apply our method to six thousands
mammalian gene families, and show that computing the robustness to changes to
cost schemes provides new and interesting insights on the evolution of gene
adjacencies and the DeCo model.Comment: Accepted, to appear in ISBRA - 11th International Symposium on
Bioinformatics Research and Applications - 2015, Jun 2015, Norfolk, Virginia,
United State
Reconstructing Gene Trees From Fitch's Xenology Relation
Two genes are xenologs in the sense of Fitch if they are separated by at
least one horizontal gene transfer event. Horizonal gene transfer is asymmetric
in the sense that the transferred copy is distinguished from the one that
remains within the ancestral lineage. Hence xenology is more precisely thought
of as a non-symmetric relation: is xenologous to if has been
horizontally transferred at least once since it diverged from the least common
ancestor of and . We show that xenology relations are characterized by a
small set of forbidden induced subgraphs on three vertices. Furthermore, each
xenology relation can be derived from a unique least-resolved edge-labeled
phylogenetic tree. We provide a linear-time algorithm for the recognition of
xenology relations and for the construction of its least-resolved edge-labeled
phylogenetic tree. The fact that being a xenology relation is a heritable graph
property, finally has far-reaching consequences on approximation problems
associated with xenology relations
The inference of gene trees with species trees
Molecular phylogeny has focused mainly on improving models for the
reconstruction of gene trees based on sequence alignments. Yet, most
phylogeneticists seek to reveal the history of species. Although the histories
of genes and species are tightly linked, they are seldom identical, because
genes duplicate, are lost or horizontally transferred, and because alleles can
co-exist in populations for periods that may span several speciation events.
Building models describing the relationship between gene and species trees can
thus improve the reconstruction of gene trees when a species tree is known, and
vice-versa. Several approaches have been proposed to solve the problem in one
direction or the other, but in general neither gene trees nor species trees are
known. Only a few studies have attempted to jointly infer gene trees and
species trees. In this article we review the various models that have been used
to describe the relationship between gene trees and species trees. These models
account for gene duplication and loss, transfer or incomplete lineage sorting.
Some of them consider several types of events together, but none exists
currently that considers the full repertoire of processes that generate gene
trees along the species tree. Simulations as well as empirical studies on
genomic data show that combining gene tree-species tree models with models of
sequence evolution improves gene tree reconstruction. In turn, these better
gene trees provide a better basis for studying genome evolution or
reconstructing ancestral chromosomes and ancestral gene sequences. We predict
that gene tree-species tree methods that can deal with genomic data sets will
be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational
Evolutionary Biology" conference, Montpellier, 201
Partial Homology Relations - Satisfiability in terms of Di-Cographs
Directed cographs (di-cographs) play a crucial role in the reconstruction of
evolutionary histories of genes based on homology relations which are binary
relations between genes. A variety of methods based on pairwise sequence
comparisons can be used to infer such homology relations (e.g.\ orthology,
paralogy, xenology). They are \emph{satisfiable} if the relations can be
explained by an event-labeled gene tree, i.e., they can simultaneously co-exist
in an evolutionary history of the underlying genes. Every gene tree is
equivalently interpreted as a so-called cotree that entirely encodes the
structure of a di-cograph. Thus, satisfiable homology relations must
necessarily form a di-cograph. The inferred homology relations might not cover
each pair of genes and thus, provide only partial knowledge on the full set of
homology relations. Moreover, for particular pairs of genes, it might be known
with a high degree of certainty that they are not orthologs (resp.\ paralogs,
xenologs) which yields forbidden pairs of genes. Motivated by this observation,
we characterize (partial) satisfiable homology relations with or without
forbidden gene pairs, provide a quadratic-time algorithm for their recognition
and for the computation of a cotree that explains the given relations
Beyond representing orthology relations by trees
Reconstructing the evolutionary past of a family of genes is an important aspect of many genomic studies. To help with this, simple relations on a set of sequences called orthology relations may be employed. In addition to being interesting from a practical point of view they are also attractive from a theoretical perspective in that e.\,g.\,a characterization is known for when such a relation is representable by a certain type of phylogenetic tree. For an orthology relation inferred from real biological data it is however generally too much to hope for that it satisfies that characterization. Rather than trying to correct the data in some way or another which has its own drawbacks, as an alternative, we propose to represent an orthology relation in terms of a structure more general than a phylogenetic tree called a phylogenetic network. To compute such a network in the form of a level-1 representation for , we formalize an orthology relation in terms of the novel concept of a symbolic 3- dissimilarity which is motivated by the biological concept of a ``cluster of orthologous groups'', or COG for short. For such maps which assign symbols rather that real values to elements, we introduce the novel {\sc Network-Popping} algorithm which has several attractive properties. In addition, we characterize an orthology relation on some set that has a level-1 representation in terms of eight natural properties for as well as in terms of level-1 representations of orthology relations on certain subsets of