34 research outputs found

    Partial Homology Relations - Satisfiability in terms of Di-Cographs

    Full text link
    Directed cographs (di-cographs) play a crucial role in the reconstruction of evolutionary histories of genes based on homology relations which are binary relations between genes. A variety of methods based on pairwise sequence comparisons can be used to infer such homology relations (e.g.\ orthology, paralogy, xenology). They are \emph{satisfiable} if the relations can be explained by an event-labeled gene tree, i.e., they can simultaneously co-exist in an evolutionary history of the underlying genes. Every gene tree is equivalently interpreted as a so-called cotree that entirely encodes the structure of a di-cograph. Thus, satisfiable homology relations must necessarily form a di-cograph. The inferred homology relations might not cover each pair of genes and thus, provide only partial knowledge on the full set of homology relations. Moreover, for particular pairs of genes, it might be known with a high degree of certainty that they are not orthologs (resp.\ paralogs, xenologs) which yields forbidden pairs of genes. Motivated by this observation, we characterize (partial) satisfiable homology relations with or without forbidden gene pairs, provide a quadratic-time algorithm for their recognition and for the computation of a cotree that explains the given relations

    Reconstructing Gene Trees From Fitch's Xenology Relation

    Full text link
    Two genes are xenologs in the sense of Fitch if they are separated by at least one horizontal gene transfer event. Horizonal gene transfer is asymmetric in the sense that the transferred copy is distinguished from the one that remains within the ancestral lineage. Hence xenology is more precisely thought of as a non-symmetric relation: yy is xenologous to xx if yy has been horizontally transferred at least once since it diverged from the least common ancestor of xx and yy. We show that xenology relations are characterized by a small set of forbidden induced subgraphs on three vertices. Furthermore, each xenology relation can be derived from a unique least-resolved edge-labeled phylogenetic tree. We provide a linear-time algorithm for the recognition of xenology relations and for the construction of its least-resolved edge-labeled phylogenetic tree. The fact that being a xenology relation is a heritable graph property, finally has far-reaching consequences on approximation problems associated with xenology relations

    Algorithmes de construction et correction d'arbres de gènes par la réconciliation

    Get PDF
    Les gènes, qui servent à encoder les fonctions biologiques des êtres vivants, forment l'unité moléculaire de base de l'hérédité. Afin d'expliquer la diversité des espèces que l'on peut observer aujourd'hui, il est essentiel de comprendre comment les gènes évoluent. Pour ce faire, on doit recréer le passé en inférant leur phylogénie, c'est-à-dire un arbre de gènes qui représente les liens de parenté des régions codantes des vivants. Les méthodes classiques d'inférence phylogénétique ont été élaborées principalement pour construire des arbres d'espèces et ne se basent que sur les séquences d'ADN. Les gènes sont toutefois riches en information, et on commence à peine à voir apparaître des méthodes de reconstruction qui utilisent leurs propriétés spécifiques. Notamment, l'histoire d'une famille de gènes en terme de duplications et de pertes, obtenue par la réconciliation d'un arbre de gènes avec un arbre d'espèces, peut nous permettre de détecter des faiblesses au sein d'un arbre et de l'améliorer. Dans cette thèse, la réconciliation est appliquée à la construction et la correction d'arbres de gènes sous trois angles différents: 1) Nous abordons la problématique de résoudre un arbre de gènes non-binaire. En particulier, nous présentons un algorithme en temps linéaire qui résout une polytomie en se basant sur la réconciliation. 2) Nous proposons une nouvelle approche de correction d'arbres de gènes par les relations d'orthologie et paralogie. Des algorithmes en temps polynomial sont présentés pour les problèmes suivants: corriger un arbre de gènes afin qu'il contienne un ensemble d'orthologues donné, et valider un ensemble de relations partielles d'orthologie et paralogie. 3) Nous montrons comment la réconciliation peut servir à "combiner'' plusieurs arbres de gènes. Plus précisément, nous étudions le problème de choisir un superarbre de gènes selon son coût de réconciliation.Genes encode the biological functions of all living organisms and are the basic molecular units of heredity. In order to explain the diversity of species that can be observed today, it is essential to understand how genes evolve. To do this, the past has to be recreated by inferring their phylogeny, i.e. a gene tree depicting the parental relationships between the coding regions of living beings. Traditional phylogenetic inference methods have been developed primarily to construct species trees and are solely based on DNA sequences. Genes, however, are rich in information and only a few known reconstruction methods make usage of their specific properties. In particular, the history of a gene family in terms of duplications and losses, obtained by the reconciliation of a gene tree with a tree species, may allow us to detect weaknesses in a tree and improve it. In this thesis, reconciliation is applied to the construction and correction of gene trees from three different angles: 1) We address the problem of resolving a non-binary gene tree. In particular, we present a linear time algorithm that solves a polytomy based on reconciliation. 2) We propose a new gene tree correction approach based on orthology and paralogy relations. Polynomial-time algorithms are presented for the following problems: modify a gene tree so that it contains a given set of orthologous genes, and validate a set of partial orthology and paralogy relations. 3) We show how reconciliation can be used to "combine'' multiple gene trees. Specifically, we study the problem of choosing a gene supertree based on its reconciliation cost

    The matroid structure of representative triple sets and triple-closure computation

    Get PDF
    The closure cl (R) of a consistent set R of triples (rooted binary trees on three leaves) provides essential information about tree-like relations that are shown by any supertree that displays all triples in . In this contribution, we are concerned with representative triple sets, that is, subsets R' of R with cl (R') = cl . In this case, R' still contains all information on the tree structure implied by R, although R' might be significantly smaller. We show that representative triple sets that are minimal w.r.t. inclusion form the basis of a matroid. This in turn implies that minimal representative triple sets also have minimum cardinality. In particular, the matroid structure can be used to show that minimum representative triple sets can be computed in polynomial time with a simple greedy approach. For a given triple set R that “identifies” a tree, we provide an exact value for the cardinality of its minimum representative triple sets. In addition, we utilize the latter results to provide a novel and efficient method to compute the closure cl (R) of a consistent triple set R that improves the time complexity (R Lr 4) of the currently fastest known method proposed by Bryant and Steel (1995). In particular, if a minimum representative triple set for R is given, it can be shown that the time complexity to compute cl (R) can be improved by a factor up to R Lr . As it turns out, collections of quartets (unrooted binary trees on four leaves) do not provide a matroid structure, in general

    Improving Comparative Genomic Studies:Definitions and Algorithms for Syntenic Blocks

    Get PDF
    Comparative genomics aims to understand the structure of genomes and the function of various genomic fragments, by transferring knowledge gained from well studied genomes, to the new object of study. Rapid and inexpensive high-throughput sequencing is making available more and more complete genome sequences. Despite the significant scientific advance, we still lack good models for the evolution of the genomic architecture, therefore analyzing these genomes presents formidable challenges. Early approaches used pairwise comparisons, but today researchers are attempting to leverage the larger potential of multiway comparisons. Current approaches are based on the identification of so called syntenic blocks: blocks of sequence that exhibit conserved features across the genomes under study. Syntenic blocks are in many ways analogous to genesâ -in many cases, the markers are used to constructing them are genes. Like genes they can exist in multiple copies, in which case we could define analogs of orthology and paralogy. However, whereas genes are studied at the sequence level, syntenic blocks are too large for that level of detail - it is their structure and function as a unit that makes them valuable for genome level comparative studies. Syntenic blocks are required for complex computations to scale to the billions of nucleotides present in many genomes; they enable comparisons across broad ranges of genomes because they filter outmuch of the individual variability; they highlight candidate regions for in-depth studies; and they facilitate whole-genome comparisons through visualization tools. The identification of such blocks is the first step in comparative studies, yet its effect on final results has not been well studied, nor has any formalization of syntenic blocks been proposed. Tools for the identification of syntenic blocks yield quite different results, thereby preventing a systematic assessment of the next steps in an analysis. Current tools do not include measurable quality objectives and thus cannot be benchmarked against themselves. Comparisons among tools have also been neglected - what few results are given use superficial measures unrelated to quality or consistency. In this thesis we address two major challenges, and present: (i) a theoretical model as well as an experimental basis for comparing syntenic blocks and thus also for improving the design of tools for the identification of syntenic blocks; (ii) a prototype model that serves as a basis for implementing effective synteny mining tools. We offer an overview of the milestones present in literature, on the development of concepts and tool related to synteny; we illustrate the application of the model and the measures by applying them to syntenic blocks produced by different contemporary tools on publicly available data sets. We have taken the first step towards a formal approach to the construction of syntenic blocks by developing a simple quality criterion based on sound evolutionary principles. Our experiments demonstrate widely divergent results among these tools, throwing into question the robustness of the basic approach in comparative genomics. Our findings highlight the need for a well founded, systematic approach to the decomposition of genomes into syntenic blocks and motivate the second part of the work - starting from the proposed model, we extend the concept with data dependent features and constraints, in order to test the concept on cases of interest

    Best match graphs

    Get PDF
    Best match graphs arise naturally as the first processing intermediate in algorithms for orthology detection. Let T be a phylogenetic (gene) tree T and σ an assignment of leaves of T to species. The best match graph (G,σ) is a digraph that contains an arc from x to y if the genes x and y reside in different species and y is one of possibly many (evolutionary) closest relatives of x compared to all other genes contained in the species σ(y). Here, we characterize best match graphs and show that it can be decided in cubic time and quadratic space whether (G,σ) derived from a tree in this manner. If the answer is affirmative, there is a unique least resolved tree that explains (G,σ), which can also be constructed in cubic time

    Reconstruction of time-consistent species trees

    Get PDF
    Background The history of gene families—which are equivalent to event-labeled gene trees—can to some extent be reconstructed from empirically estimated evolutionary event-relations containing pairs of orthologous, paralogous or xenologous genes. The question then arises as whether inferred event-labeled gene trees are “biologically feasible” which is the case if one can find a species tree with which the gene tree can be reconciled in a time-consistent way. Results In this contribution, we consider event-labeled gene trees that contain speciations, duplications as well as horizontal gene transfer (HGT) and we assume that the species tree is unknown. Although many problems become NP-hard as soon as HGT and time-consistency are involved, we show, in contrast, that the problem of finding a time-consistent species tree for a given event-labeled gene can be solved in polynomial-time. We provide a cubic-time algorithm to decide whether a “time-consistent” species tree for a given event-labeled gene tree exists and, in the affirmative case, to construct the species tree within the same time-complexity

    Évolution de l’architecture des génomes : modélisation et reconstruction phylogénétique

    Get PDF
    Genomes evolve through processes that modify their content and organization at different scales, ranging from the substitution, insertion or deletion of a single nucleotide to the duplication, loss or transfer of a gene and to large scale chromosomal rearrangements. Extant genomes are the result of a combination of many such processes, which makes it difficult to reconstruct the overall picture of genome evolution. As a result, most models and methods focus on one scale and use only one kind of data, such as gene orders or sequence alignments. Most phylogenetic reconstruction methods focus on the evolution of sequences. Recently, some of these methods have been extended to integrate gene family evolution. Chromosomal rearrangements have also been extensively studied, leading to the development of many models for the evolution of the architecture of genomes. These two ways to model genome evolution have not exchanged much so far, mainly because of computational issues. In this thesis, I present a new model of evolution for the architecture of genomes that accounts for the evolution of gene families. With this model, one can reconstruct the evolutionary history of gene adjacencies and gene order accounting for events that modify the gene content of genomes (duplications and losses of genes) and for events that modify the architecture of genomes (chromosomal rearrangements). Integrating these two types of information in a single model yields more accurate evolutionary histories. Moreover, we show that reconstructing ancestral gene orders can provide feedback on the quality of gene trees thus paving the way for an integrative model and reconstruction methodL'évolution des génomes peut être observée à plusieurs échelles, chaque échelle révélant des processus évolutifs différents. A l'échelle de séquences ADN, il s'agit d'insertions, délétions et substitutions de nucléotides. Si l'on s'intéresse aux gènes composant les génomes, il s'agit de duplications, pertes et transferts horizontaux de gènes. Et à plus large échelle, on observe des réarrangements chromosomiques modifiant l'agencement des gènes sur les chromosomes. Reconstruire l'histoire évolutive des génomes implique donc de comprendre et de modéliser tous les processus à l'œuvre, ce qui reste hors de notre portée. A la place, les efforts de modélisation ont exploré deux directions principales. D'un côté, les méthodes de reconstruction phylogénétique se sont concentrées sur l'évolution des séquences, certaines intégrant l'évolution des familles de gènes. D'un autre côté, les réarrangements chromosomiques ont été très largement étudiés, donnant naissance à de nombreux modèles d'évolution de l'architecture des génomes. Ces deux voies de modélisation se sont rarement rencontrées jusqu'à récemment. Au cours de ma thèse, j'ai développé un modèle d'évolution de l'architecture des génomes prenant en compte l'évolution des gènes et des séquences. Ce modèle rend possible une reconstruction probabiliste de l'histoire évolutive d'adjacences et de l'ordre des gènes de génomes ancestraux en tenant compte à la fois d'évènements modifiant le contenu en gènes des génomes (duplications et pertes de gènes), et d'évènements modifiant l'architecture des génomes (les réarrangements chromosomiques). Intégrer l'information phylogénétique à la reconstruction d'ordres des gènes permet de reconstruire des histoires évolutives plus complètes. Inversement, la reconstruction d'ordres des gènes ancestraux peut aussi apporter une information complémentaire à la phylogénie et peut être utilisée comme un critère pour évaluer la qualité d'arbres de gènes, ouvrant la voie à un modèle et une reconstruction intégrativ
    corecore