26 research outputs found

    Generalizations of the genomic rank distance to indels

    Get PDF
    MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Weighted Minimum-Length Rearrangement Scenarios

    Get PDF
    We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n^4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements

    Models and Algorithms for Comparative Genomics

    Get PDF
    The deluge of sequenced whole-genome data has motivated the study of comparative genomics, which provides global views on genome evolution, and also offers practical solutions in deciphering the functional roles of components of genomes. A fundamental computational problem in whole-genome comparison is to infer the most likely large-scale events~(rearrangements and content-modifying events) of given genomes during their history of evolution. Based on the principle of parsimony, such inference is usually formulated as the so called edit distance problems~(for two genomes) or median problems~(for multiple genomes), i.e., to compute the minimum number of certain types of large-scale events that can explain the differences of the given genomes. In this dissertation, we develop novel algorithms for edit distance problems and median problems and also apply them to analyze and annotate biological datasets. For pairwise whole-genome comparison, we study the most challenging cases of edit distance problems---the given genomes contain duplicate genes. We proposed several exact algorithms and approximation algorithms under various combinations of large-scale events. Specifically, we designed the first exact algorithm to compute the edit distance under the DCJ~(double-cut-and-join) model, and the first exact algorithm to compute the edit distance under a model including DCJ operations and segmental duplications. We devised a (1.5+ϵ)(1.5 + \epsilon)-approximation algorithm to compute the edit distance under a model including DCJ operations, insertions, and deletions. We also proposed a very fast and exact algorithm to compute the exemplar breakpoint distance. For multiple whole-genome comparison, we study the median problem under the DCJ model. We designed a polynomial-time algorithm using a network flow formulation to compute the so called adequate subgraphs---a central phase in computing the median. We also proved that an existing upper bound of the median distance is tight. These above algorithms determine the correspondence between functional elements~(for instance, genes) across genomes, and thus can be used to systematically infer functional relationships and annotate genomes. For example, we applied our methods to infer orthologs and in-paralogs between a pair of genomes---a key step in analyzing the functions of protein-coding genes. On biological whole-genome datasets, our methods run very fast, scale up to whole genomes, and also achieve very high accuracy

    Algorithms for reconstruction of chromosomal structures

    Get PDF

    Évolution des génomes par mutations locales et globales : une approche d’alignement

    Get PDF
    Durant leur évolution, les génomes accumulent des mutations pouvant affecter d’un nucléotide à plusieurs gènes. Les modifications au niveau du nombre et de l’organisation des gènes dans les génomes sont dues à des mutations globales, telles que les duplications, les pertes et les réarrangements. En comparant les ordres de gènes des génomes, il est possible d’inférer les événements évolutifs les plus fréquents, le contenu en gènes des espèces ancestrales ainsi que les histoires évolutives ayant menées aux ordres observés. Dans cette thèse, nous nous intéressons au développement de nouvelles méthodes algorithmiques, par approche d’alignement, afin d’analyser ces différents aspects de l’évolution des génomes. Nous nous intéressons à la comparaison de deux ou d’un ensemble de génomes reliés par une phylogénie, en tenant compte des mutations globales. Pour commencer, nous étudions la complexité théorique de plusieurs variantes du problème de l’alignement de deux ordres de gènes par duplications et pertes, ainsi que de l’approximabilité de ces problèmes. Nous rappelons ensuite les algorithmes exacts, en temps exponentiel, existants, et développons des heuristiques efficaces. Nous proposons, dans un premier temps, DLAlign, une heuristique quadratique pour le problème d’alignement de deux ordres de gènes par duplications et pertes. Ensuite, nous présentons, OrthoAlign, une extension de DLAlign, qui considère, en plus des duplications et pertes, les réarrangements et les substitutions. Nous abordons également le problème de l’alignement phylogénétique de génomes. Pour commencer, l’heuristique OrthoAlign est adaptée afin de permettre l’inférence de génomes ancestraux au noeuds internes d’un arbre phylogénétique. Nous proposons enfin, MultiOrthoAlign, une heuristique plus robuste, basée sur la médiane, pour l’inférence de génomes ancestraux et d’histoires évolutives d’un ensemble de génomes représentés aux feuilles d’un arbre phylogénétique.During the evolution process, genomes accumulate mutations that may affect the genome at different levels, ranging from one base to the overall gene content. Global mutations affecting gene content and organization are mainly duplications, losses and rearrangements. By comparing gene orders, it is possible to infer the most frequent events, the gene content in the ancestral genomes and the evolutionary histories of the observed gene orders. In this thesis, we are interested in developing new algorithmic methods based on an alignment approach for comparing two or a set of genomes represented as gene orders and related through a phylogenetic tree, based on global mutations. We study the theoretical complexity and the approximability of different variants of the two gene orders alignment problem by duplications and losses. Then, we present the existing exact exponential time algorithms, and develop efficient heuristics for these problems. First, we developed DLAlign, a quadratic time heuristic for the two gene orders alignment problem by duplications and losses. Then, we developed OrthoAlign, a generalization of DLAlign, accounting for most genome-wide evolutionary events such as duplications, losses, rearrangements and substitutions. We also study the phylogenetic alignment problem. First, we adapt our heuristic OrthoAlign in order to infer ancestral genomes at the internal nodes of a given phylogenetic tree. Finally, we developed MultiOrthoAlign, a more robust heuristic, based on the median problem, for the inference of ancestral genomes and evolutionary histories of extent genomes labeling leaves of a phylogenetic tree
    corecore