102 research outputs found

    Improved gene tree error correction in the presence of horizontal gene transfer

    Get PDF
    Motivation: The accurate inference of gene trees is a necessary step in many evolutionary studies. Although the problem of accurate gene tree inference has received considerable attention, most existing methods are only applicable to gene families unaffected by horizontal gene transfer. As a result, the accurate inference of gene trees affected by horizontal gene transfer remains a largely unaddressed problem. Results: In this study, we introduce a new and highly effective method for gene tree error correction in the presence of horizontal gene transfer. Our method efficiently models horizontal gene transfers, gene duplications and losses, and uses a statistical hypothesis testing framework [Shimodaira–Hasegawa (SH) test] to balance sequence likelihood with topological information from a known species tree. Using a thorough simulation study, we show that existing phylogenetic methods yield inaccurate gene trees when applied to horizontally transferred gene families and that our method dramatically improves gene tree accuracy. We apply our method to a dataset of 11 cyanobacterial species and demonstrate the large impact of gene tree accuracy on downstream evolutionary analyses. Availability and implementation: An implementation of our method is available at http://compbio.mit.edu/treefix-dtl/National Science Foundation (U.S.) (CAREER Award 0644282)National Institutes of Health (U.S.) (RC2 HG005639)National Science Foundation (U.S.). Assembling the Tree of Life (Program) (0936234)University of Connecticu

    Gene Family Histories: Theory and Algorithms

    Get PDF
    Detailed gene family histories and reconciliations with species trees are a prerequisite for studying associations between genetic and phenotypic innovations. Even though the true evolutionary scenarios are usually unknown, they impose certain constraints on the mathematical structure of data obtained from simple yes/no questions in pairwise comparisons of gene sequences. Recent advances in this field have led to the development of methods for reconstructing (aspects of) the scenarios on the basis of such relation data, which can most naturally be represented by graphs on the set of considered genes. We provide here novel characterizations of best match graphs (BMGs) which capture the notion of (reciprocal) best hits based on sequence similarities. BMGs provide the basis for the detection of orthologous genes (genes that diverged after a speciation event). There are two main sources of error in pipelines for orthology inference based on BMGs. Firstly, measurement errors in the estimation of best matches from sequence similarity in general lead to violations of the characteristic properties of BMGs. The second issue concerns the reconstruction of the orthology relation from a BMG. We show how to correct estimated BMG to mathematically valid ones and how much information about orthologs is contained in BMGs. We then discuss implicit methods for horizontal gene transfer (HGT) inference that focus on pairs of genes that have diverged only after the divergence of the two species in which the genes reside. This situation defines the edge set of an undirected graph, the later-divergence-time (LDT) graph. We explore the mathematical structure of LDT graphs and show how much information about all HGT events is contained in such LDT graphs

    SimPhy: Phylogenomic Simulation of Gene, Locus, and Species Trees

    No full text
    We present a fast and flexible software package--SimPhy--for the simulation of multiple gene families evolving under incomplete lineage sorting, gene duplication and loss, horizontal gene transfer--all three potentially leading to species tree/gene tree discordance--and gene conversion. SimPhy implements a hierarchical phylogenetic model in which the evolution of species, locus, and gene trees is governed by global and local parameters (e.g., genome-wide, species-specific, locus-specific), that can be fixed or be sampled from a priori statistical distributions. SimPhy also incorporates comprehensive models of substitution rate variation among lineages (uncorrelated relaxed clocks) and the capability of simulating partitioned nucleotide, codon, and protein multilocus sequence alignments under a plethora of substitution models using the program INDELible. We validate SimPhy's output using theoretical expectations and other programs, and show that it scales extremely well with complex models and/or large trees, being an order of magnitude faster than the most similar program (DLCoal-Sim). In addition, we demonstrate how SimPhy can be useful to understand interactions among different evolutionary processes, conducting a simulation study to characterize the systematic overestimation of the duplication time when using standard reconciliation methods. SimPhy is available at https://github.com/adamallo/SimPhy, where users can find the source code, precompiled executables, a detailed manual and example cases

    Building alternative consensus trees and supertrees using k-means and Robinson and Foulds (RF) distance

    Full text link
    Each gene has its own evolutionary history which can substantially differ from the evolutionary histories of other genes. For example, some individual genes or operons can be affected by specific horizontal gene transfer and recombination events. Thus, the evolutionary history of each gene should be represented by its own phylogenetic tree which may display different evolutionary patterns from the species tree that accounts for the main patterns of vertical descent. The output of traditional consensus tree or supertree inference methods is a unique consensus tree or supertree. We describe a new efficient method for inferring multiple alternative consensus trees and supertrees to best represent the most important evolutionary patterns of a given set of gene phylogenies. We show how an adapted version of the popular k-means clustering algorithm, based on some interesting properties of the Robinson and Foulds distance, can be used to partition a given set of trees into one (for homogeneous data) or multiple (for heterogeneous data) cluster(s) of trees. Moreover, we adapt the popular Cali\'nski-Harabasz, Silhouette, Ball and Hall, and Gap cluster validity indices to tree clustering with k-means. A special attention is given to the relevant but very challenging problem of inferring alternative supertrees. The use of the Euclidean property of the objective function of the method makes it faster than the existing tree clustering techniques, and thus perfectly suitable for analyzing large evolutionary datasets. We apply the new method to discover alternative supertrees characterizing the main patterns of evolution of SARS-CoV-2 and genetically related betacoronaviruses.Comment: submitte

    Les génomes bactériens, une histoire de transferts de gènes, de recombinaison et de cladogénèse

    Get PDF
    In bacterial genomes, the frequent horizontal gene transfers (HGT) introduce genomic novelties that can promote the diversification of bacterial populations. In opposition, homologous recombination (HR) within populations homogenizes their genotypes, enforcing their cohesion. These processes of genetic exchange, and their patterns of occurrence among and within lineages, must have a great impact on bacterial cladogenesis. Beyond the pattern of exchanges actually occurring between bacteria, the traces of HR and HGT we observe in their genomes reflect what events were fixed throughout their history. This fixation process can be biased regarding the nature of genes or alleles that were introduced. Notably, natural selection can drive the fixation of transferred genes that bring new ecological adaptations. In addition, some mechanical biases in the recombination process itself may lead to the fixation of non-adaptive alleles. We aimed to characterize such adaptive and non-adaptive processes that are shaping bacterial genomes. To this end, several aspects of genome evolution, such as variations of their gene repertoires, of their architecture and of their nucleotide composition were examined in the light of their history of transfer and recombinationDans les génomes bactériens, les fréquents transferts horizontaux de gènes (HGT) introduisent des innovations génomiques qui peuvent entraîner la diversification des populations bactériennes. À l'inverse, la recombinaison homologue (RH) au sein des populations homogénéise leurs génotypes, et ainsi renforce leur cohésion. Ces processus d'échange génétique, et la fréquence à laquelle ils interviennent au sein et entre les populations, doivent avoir un grand impact sur la cladogénèse bactérienne. Au-delà de la configuration des échanges qui se sont réellement produits entre les bactéries, les traces de RH et de HGT que nous observons dans leurs génomes reflètent les événements qui ont été fixés tout au long de leur histoire. Ce processus de fixation peut être biaisé en ce qui concerne la nature des gènes ou allèles qui ont été introduits. La sélection naturelle peut notamment conduire à la fixation des gènes transférés qui apportent de nouvelles adaptations écologiques. En outre, des biais mécaniques dans le processus de recombinaison lui-même peuvent conduire à la fixation d'allèles non-adaptatifs. Nous avons cherché à caractériser certains de ces processus adaptatifs et non-adaptatifs qui façonnent les génomes bactériens. À cette fin, plusieurs aspects de l'évolution des génomes, comme les variations de leurs répertoires de gènes, de leur architecture et de leur composition en nucléotides ont été examinés à la lumière de leur histoire de transfert et de recombinaiso

    Détection des transferts horizontaux de gènes : modèles et algorithmes appliqués à l'évolution des espèces et des langues

    Get PDF
    Le transfert horizontal de gènes (THG, ou transfert latéral de gènes) est un mécanisme d'évolution naturel qui consiste en le transfert direct du matériel génétique d'une espèce à une autre. La possibilité que le transfert horizontal de gènes puisse jouer un rôle clé dans l'évolution biologique est un changement fondamental dans notre perception des aspects généraux de la biologie évolutive survenu ces dernières années. Par exemple, les bactéries et les virus possèdent des mécanismes sophistiqués d'acquisition de nouveaux gènes par transfert horizontal leur permettant de s'adapter et d'évoluer adéquatement dans leur environnement. Jusqu'à tout récemment, les méthodes de détection de ce mécanisme reposaient essentiellement sur l'analyse de séquences et étaient très rarement automatisées. Il est impossible de représenter l'évolution d'organismes ayant subi des THG à l'aide d'arbres phylogénétiques acycliques. La présentation adéquate est celle d'un réseau. Dans cette thèse, nous décrivons un nouveau modèle de ce mécanisme d'évolution, en se basant sur l'étude de différences topologiques et métriques entre un arbre d'espèces et un arbre du gène inférés pour le même ensemble d'espèces. Les méthodes qui en découlent ont été appliquées à des jeux de données réelles où des hypothèses de transferts latéraux de gènes étaient plausibles. Des simulations Monté-Carlo ont été menées afin d'évaluer la qualité des résultats par rapport à des méthodes existantes. Nous présentons également une généralisation du modèle de transferts horizontaux complets qui est applicable pour détecter des transferts partiels et identifier des gènes mosaïques. Dans ce dernier modèle, on suppose qu'une partie seulement du gène a été transférée. Enfin, nous présentons une application de ces nouvelles méthodes servant à modéliser des emprunts de mots survenus durant l'évolution des langues indo-européennes. \ud ______________________________________________________________________________ \ud MOTS-CLÉS DE L’AUTEUR : arbre phylogénétique, réseau réticulé, transfert horizontal de gènes, critère des moindres carrés, distance de Robinson et Foulds, dissimilarité de bipartitions, biolinguistique
    • …
    corecore