23 research outputs found

    Efficient Non-Binary Gene Tree Resolution with Weighted Reconciliation Cost

    Get PDF
    Polytomies in gene trees are multifurcated nodes corresponding to unresolved parts of the tree, usually due to insufficient differentiation between sequences of homologous gene copies. Apart from gene sequences, other information such as that contained in the species tree can be used to resolve such intricate parts of a gene tree. The problem of resolving a multifurcated tree has been considered by many authors, the objective function often being the number of duplications and losses reflected by the reconciliation of the resolved gene tree with the species tree. Here, we present PolytomySolver, an algorithm accounting for a more general model allowing different costs for duplications and losses per species. The time complexity of this algorithm is linear for the unit cost and is quadratic for the general cost, which outperforms the best known solutions so far by a linear factor. We show on simulated trees that the gain in theoretical complexity has a real practical impact on running times

    Méthodes et algorithmes pour l’amélioration de l’inférence de l’histoire évolutive des génomes

    Full text link
    Les phylogénies de gènes offrent un cadre idéal pour l’étude comparative des génomes. Non seulement elles incorporent l’évolution des espèces par spéciation, mais permettent aussi de capturer l’expansion et la contraction des familles de gènes par gains et pertes de gènes. La détermination de l’ordre et de la nature de ces événements équivaut à inférer l’histoire évolutive des familles de gènes, et constitue un prérequis à plusieurs analyses en génomique comparative. En effet, elle est requise pour déterminer efficacement les relations d’orthologies entre gènes, importantes pour la prédiction des structures et fonctions de protéines et les analyses phylogénétiques, pour ne citer que ces applications. Les méthodes d’inférence d’histoires évolutives de familles de gènes supposent que les phylogénies considérées sont dénuées d’erreurs. Ces phylogénies de gènes, souvent recons- truites à partir des séquences d’acides aminés ou de nucléotides, ne représentent cependant qu’une estimation du vrai arbre de gènes et sont sujettes à des erreurs provenant de sources variées, mais bien documentées. Pour garantir l’exactitude des histoires inférées, il faut donc s’assurer de l’absence d’erreurs au sein des arbres de gènes. Dans cette thèse, nous étudions cette problématique sous deux aspects. Le premier volet de cette thèse concerne l’identification des déviations du code génétique, l’une des causes d’erreurs d’annotations se propageant ensuite dans les phylogénies. Nous développons à cet effet, une méthodologie pour l’inférence de déviations du code génétique standard par l’analyse des séquences codantes et des ARNt. Cette méthodologie est cen- trée autour d’un algorithme de prédiction de réaffectations de codons, appelé CoreTracker. Nous montrons tout d’abord l’efficacité de notre méthode, puis l’utilisons pour démontrer l’évolution du code génétique dans les génomes mitochondriaux des algues vertes. Le second volet de la thèse concerne le développement de méthodes efficaces pour la correction et la construction d’arbres phylogénétiques de gènes. Nous présentons deux méthodes exploitant l’information sur l’évolution des espèces. La première, ProfileNJ , est déterministe et très rapide. Elle corrige les arbres de gènes en ciblant exclusivement les sous-arbres présentant un support statistique faible. Son application sur les familles de gènes d’Ensembl Compara montre une amélioration nette de la qualité des arbres, par comparaison à ceux proposés par la base de données. La seconde, GATC, utilise un algorithme génétique et traite le problème comme celui de l’optimisation multi-objectif de la topologie des arbres de gènes, étant données des contraintes relatives à l’évolution des familles de gènes par mutation de séquences et par gain/perte de gènes. Nous montrons qu’une telle approche est non seulement efficace, mais appropriée pour la construction d’ensemble d’arbres de référence.Gene trees offer a proper framework for comparative genomics. Not only do they provide information about species evolution through speciation events, but they also capture gene family expansion and contraction by gene gains and losses. They are thus used to infer the evolutionary history of gene families and accurately predict the orthologous relationship between genes, on which several biological analyses rely. Methods for inferring gene family evolution explicitly assume that gene trees are known without errors. However, standard phylogenetic methods for tree construction based on se- quence data are well documented as error-prone. Gene trees constructed using these methods will usually introduce biases during the inference of gene family histories. In this thesis, we present new methods aiming to improve the quality of phylogenetic gene trees and thereby the accuracy of underlying evolutionary histories of their corresponding gene families. We start by providing a framework to study genetic code deviations, one possible reason of annotation errors that could then spread to the phylogeny reconstruction. Our framework is based on analysing coding sequences and tRNAs to predict codon reassignments. We first show its efficiency, then apply it to green plant mitochondrial genomes. The second part of this thesis focuses on the development of efficient species tree aware methods for gene tree construction. We present ProfileNJ , a fast and deterministic correction method that targets weakly supported branches of a gene tree. When applied to the gene families of the Ensembl Compara database, ProfileNJ produces an arguably better set of gene trees compared to the ones available in Ensembl Compara. We later use a different strategy, based on a genetic algorithm, allowing both construction and correction of gene trees. This second method called GATC, treats the problem as a multi-objective optimisation problem in which we are looking for the set of gene trees optimal for both sequence data and information of gene family evolution through gene gain and loss. We show that this approach yields accurate trees and is suitable for the construction of reference datasets to benchmark other methods

    Reconstructing the History of Syntenies Through Super-Reconciliation

    Get PDF
    Classical gene and species tree reconciliation, used to infer the history of gene gain and loss explaining the evolution of gene families, assumes an independent evolution for each family. While this assumption is reasonable for genes that are far apart in the genome, it is clearly not suited for genes grouped in syntenic blocks, which are more plausibly the result of a concerted evolution. Here, we introduce the Super-Reconciliation model, that extends the traditional Duplication-Loss model to the reconciliation of a set of trees, accounting for segmental duplications and losses. From a complexity point of view, we show that the associated decision problem is NP-hard. We then give an exact exponential-time algorithm for this problem, assess its time efficiency on simulated datasets, and give a proof of concept on the opioid receptor genes

    GATC: a genetic algorithm for gene tree construction under the Duplication-Transfer-Loss model of evolution

    No full text
    Abstract Background Several methods have been developed for the accurate reconstruction of gene trees. Some of them use reconciliation with a species tree to correct, a posteriori, errors in gene trees inferred from multiple sequence alignments. Unfortunately the best fit to sequence information can be lost during this process. Results We describe GATC, a new algorithm for reconstructing a binary gene tree with branch length. GATC returns optimal solutions according to a measure combining both tree likelihood (according to sequence evolution) and a reconciliation score under the Duplication-Transfer-Loss (DTL) model. It can either be used to construct a gene tree from scratch or to correct trees infered by existing reconstruction method, making it highly flexible to various input data types. The method is based on a genetic algorithm acting on a population of trees at each step. It substantially increases the efficiency of the phylogeny space exploration, reducing the risk of falling into local minima, at a reasonable computational time. We have applied GATC to a dataset of simulated cyanobacterial phylogenies, as well as to an empirical dataset of three reference gene families, and showed that it is able to improve gene tree reconstructions compared with current state-of-the-art algorithms. Conclusion The proposed algorithm is able to accurately reconstruct gene trees and is highly suitable for the construction of reference trees. Our results also highlight the efficiency of multi-objective optimization algorithms for the gene tree reconstruction problem. GATC is available on Github at: https://github.com/UdeM-LBIT/GATC

    Additional file 1 of GATC: a genetic algorithm for gene tree construction under the Duplication-Transfer-Loss model of evolution

    No full text
    Contains supplementary information on the effect of operator rates (Figure S1) and errors in the species tree (Figure S2) on reconstruction accuracy. It also contains the original reference tree of the Poyeye family (Figure S3) and the four alternative trees obtained by GATC (Figure S4-S7). (PDF 315 kb

    Real-World Molecular Out-Of-Distribution: Specification and Investigation

    No full text
    This study presents a rigorous framework for investigating Molecular Out-Of-Distribution (MOOD) generalization in drug discovery. The concept of MOOD is first clarified through a problem specification that demonstrates how the covariate shifts encountered during real-world deployment can be characterized by the distribution of sample distances to the training set. We find that these shifts can cause performance to drop by up to 60% and uncertainty calibration by up to 40%. This leads us to propose a splitting protocol that aims to close the gap between deployment and testing. Then, using this protocol, a thorough investigation is conducted to assess the impact of model design, model selection and dataset characteristics on MOOD performance and uncertainty calibration. We find that appropriate representations and algorithms with built-in uncertainty estimation are crucial to improve performance and uncertainty calibration. This study sets itself apart by its exhaustiveness and opens an exciting avenue to benchmark meaningful, algorithmic progress in molecular scoring. All related code can be found on Github at https://github.com/valence-labs/mood-experiments

    Efficient Gene Tree Correction Guided by Genome Evolution.

    Get PDF
    Gene trees inferred solely from multiple alignments of homologous sequences often contain weakly supported and uncertain branches. Information for their full resolution may lie in the dependency between gene families and their genomic context. Integrative methods, using species tree information in addition to sequence information, often rely on a computationally intensive tree space search which forecloses an application to large genomic databases.We propose a new method, called ProfileNJ, that takes a gene tree with statistical supports on its branches, and corrects its weakly supported parts by using a combination of information from a species tree and a distance matrix. Its low running time enabled us to use it on the whole Ensembl Compara database, for which we propose an alternative, arguably more plausible set of gene trees. This allowed us to perform a genome-wide analysis of duplication and loss patterns on the history of 63 eukaryote species, and predict ancestral gene content and order for all ancestors along the phylogeny.A web interface called RefineTree, including ProfileNJ as well as a other gene tree correction methods, which we also test on the Ensembl gene families, is available at: http://www-ens.iro.umontreal.ca/~adbit/polytomysolver.html. The code of ProfileNJ as well as the set of gene trees corrected by ProfileNJ from Ensembl Compara version 73 families are also made available
    corecore