237 research outputs found

    Edit distance metrics for measuring dissimilarity between labeled gene trees

    Full text link
    Les arbres phylogénétiques sont des instruments de biologie évolutive offrant de formidables moyens d'étude pour la génomique comparative. Ils fournissent des moyens de représenter des mécanismes permettant de modéliser les relations de parenté entre les espèces ou les membres de familles de gènes en fonction de la diversité taxonomique, ainsi que des observations et des renseignements sur l'histoire évolutive, la structure et la variation des processus biologiques. Cependant, les méthodes traditionnelles d'inférence phylogénétique ont la réputation d'être sensibles aux erreurs. Il est donc indispensable de comparer les arbres phylogénétiques et de les analyser pour obtenir la meilleure interprétation des données biologiques qu'ils peuvent fournir. Nous commençons par aborder les travaux connexes existants pour déduire, comparer et analyser les arbres phylogénétiques, en évaluant leurs bonnes caractéristiques ainsi que leurs défauts, et discuter des pistes d'améliorations futures. La deuxième partie de cette thèse se concentre sur le développement de mesures efficaces et précises pour analyser et comparer des paires d'arbres génétiques avec des nœuds internes étiquetés. Nous montrons que notre extension de la métrique bien connue de Robinson-Foulds donne lieu à une bonne métrique pour la comparaison d'arbres génétiques étiquetés sous divers modèles évolutifs, et qui peuvent impliquer divers événements évolutifs.Phylogenetic trees are instruments of evolutionary biology offering great insight for comparative genomics. They provide mechanisms to model the kinship relations between species or members of gene families as a function of taxonomic diversity. They also provide evidence and insights into the evolutionary history, structure, and variation of biological processes. However, traditional phylogenetic inference methods have the reputation to be prone to errors. Therefore, comparing and analysing phylogenetic trees is indispensable for obtaining the best interpretation of the biological information they can provide. We start by assessing existing related work to infer, compare, and analyse phylogenetic trees, evaluating their advantageous traits and flaws, and discussing avenues for future improvements. The second part of this thesis focuses on the development of efficient and accurate metrics to analyse and compare pairs of gene trees with labeled internal nodes. We show that our attempt in extending the popular Robinson-Foulds metric is useful for the preliminary analysis and comparison of labeled gene trees under various evolutionary models that may involve various evolutionary events

    The generalized Robinson-Foulds metric

    Get PDF
    The Robinson-Foulds (RF) metric is arguably the most widely used measure of phylogenetic tree similarity, despite its well-known shortcomings: For example, moving a single taxon in a tree can result in a tree that has maximum distance to the original one; but the two trees are identical if we remove the single taxon. To this end, we propose a natural extension of the RF metric that does not simply count identical clades but instead, also takes similar clades into consideration. In contrast to previous approaches, our model requires the matching between clades to respect the structure of the two trees, a property that the classical RF metric exhibits, too. We show that computing this generalized RF metric is, unfortunately, NP-hard. We then present a simple Integer Linear Program for its computation, and evaluate it by an all-against-all comparison of 100 trees from a benchmark data set. We find that matchings that respect the tree structure differ significantly from those that do not, underlining the importance of this natural condition.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

    A generalized Robinson-Foulds distance for labeled trees

    Get PDF
    Background: The Robinson-Foulds (RF) distance is a well-established measure between phylogenetic trees. Despite a lack of biological justification, it has the advantages of being a proper metric and being computable in linear time. For phylogenetic applications involving genes, however, a crucial aspect of the trees ignored by the RF metric is the type of the branching event (e.g. speciation, duplication, transfer, etc). Results: We extend RF to trees with labeled internal nodes by including a node flip operation, alongside edge contractions and extensions. We explore properties of this extended RF distance in the case of a binary labeling. In particular, we show that contrary to the unlabeled case, an optimal edit path may require contracting “good” edges, i.e. edges shared between the two trees. Conclusions: We provide a 2-approximation algorithm which is shown to perform well empirically. Looking ahead, computing distances between labeled trees opens up a variety of new algorithmic directions. Implementation and simulations available at https://github.com/DessimozLab/pylabeledrf

    Synthesizing species trees from gene trees using the parameterized and graph-theoretic approaches

    Get PDF
    Gene trees describe how parts of the species have evolved over time, and it is assumed that gene trees have evolved along the branches of the species tree. However, some of gene trees are often discordant with the corresponding species tree due to the complicated evolution history of genes. To overcome this obstacle, median problems have emerged as a major tool for synthesizing species trees by reconciling discordance in a given collection of gene trees. Given a collection of gene trees and a cost function, the median problem seeks a tree, called median tree, that minimizes the overall cost to the gene trees. Median tree problems are typically NP-hard, and there is an increased interest in making such median tree problems available for large-scale species tree construction. In this thesis work, we first show that the gene duplication median tree problem satisfied the weaker version of the Pareto property and propose a parameterized algorithm to solve the gene duplication median tree problem. Second, we design two efficient methods to handle the issues of applying the parameterized algorithm to unrooted gene trees which are sampled from the different species. Third, we introduce the graph-theoretic formulation of the Robinson-Foulds median tree problem and a new tree edit operation. Fourth, we propose a new metric between two phylogenetic trees and examine the statistical properties of the metric. Finally, we propose a new clustering criteria in a bipartite network and propose a new NP-hard problem and its ILP formulation

    A generalized Robinson-Foulds distance for labeled trees.

    Get PDF
    The Robinson-Foulds (RF) distance is a well-established measure between phylogenetic trees. Despite a lack of biological justification, it has the advantages of being a proper metric and being computable in linear time. For phylogenetic applications involving genes, however, a crucial aspect of the trees ignored by the RF metric is the type of the branching event (e.g. speciation, duplication, transfer, etc). We extend RF to trees with labeled internal nodes by including a node flip operation, alongside edge contractions and extensions. We explore properties of this extended RF distance in the case of a binary labeling. In particular, we show that contrary to the unlabeled case, an optimal edit path may require contracting "good" edges, i.e. edges shared between the two trees. We provide a 2-approximation algorithm which is shown to perform well empirically. Looking ahead, computing distances between labeled trees opens up a variety of new algorithmic directions.Implementation and simulations available at https://github.com/DessimozLab/pylabeledrf

    The Bourque Distances for Mutation Trees of Cancers

    Get PDF
    Mutation trees are rooted trees of arbitrary node degree in which each node is labeled with a mutation set. These trees, also referred to as clonal trees, are used in computational oncology to represent the mutational history of tumours. Classical tree metrics such as the popular Robinson - Foulds distance are of limited use for the comparison of mutation trees. One reason is that mutation trees inferred with different methods or for different patients often contain different sets of mutation labels. Here, we generalize the Robinson - Foulds distance into a set of distance metrics called Bourque distances for comparing mutation trees. A connection between the Robinson - Foulds distance and the nearest neighbor interchange distance is also presented
    corecore