5 research outputs found

    Évolution des génomes par mutations locales et globales : une approche d’alignement

    Get PDF
    Durant leur évolution, les génomes accumulent des mutations pouvant affecter d’un nucléotide à plusieurs gènes. Les modifications au niveau du nombre et de l’organisation des gènes dans les génomes sont dues à des mutations globales, telles que les duplications, les pertes et les réarrangements. En comparant les ordres de gènes des génomes, il est possible d’inférer les événements évolutifs les plus fréquents, le contenu en gènes des espèces ancestrales ainsi que les histoires évolutives ayant menées aux ordres observés. Dans cette thèse, nous nous intéressons au développement de nouvelles méthodes algorithmiques, par approche d’alignement, afin d’analyser ces différents aspects de l’évolution des génomes. Nous nous intéressons à la comparaison de deux ou d’un ensemble de génomes reliés par une phylogénie, en tenant compte des mutations globales. Pour commencer, nous étudions la complexité théorique de plusieurs variantes du problème de l’alignement de deux ordres de gènes par duplications et pertes, ainsi que de l’approximabilité de ces problèmes. Nous rappelons ensuite les algorithmes exacts, en temps exponentiel, existants, et développons des heuristiques efficaces. Nous proposons, dans un premier temps, DLAlign, une heuristique quadratique pour le problème d’alignement de deux ordres de gènes par duplications et pertes. Ensuite, nous présentons, OrthoAlign, une extension de DLAlign, qui considère, en plus des duplications et pertes, les réarrangements et les substitutions. Nous abordons également le problème de l’alignement phylogénétique de génomes. Pour commencer, l’heuristique OrthoAlign est adaptée afin de permettre l’inférence de génomes ancestraux au noeuds internes d’un arbre phylogénétique. Nous proposons enfin, MultiOrthoAlign, une heuristique plus robuste, basée sur la médiane, pour l’inférence de génomes ancestraux et d’histoires évolutives d’un ensemble de génomes représentés aux feuilles d’un arbre phylogénétique.During the evolution process, genomes accumulate mutations that may affect the genome at different levels, ranging from one base to the overall gene content. Global mutations affecting gene content and organization are mainly duplications, losses and rearrangements. By comparing gene orders, it is possible to infer the most frequent events, the gene content in the ancestral genomes and the evolutionary histories of the observed gene orders. In this thesis, we are interested in developing new algorithmic methods based on an alignment approach for comparing two or a set of genomes represented as gene orders and related through a phylogenetic tree, based on global mutations. We study the theoretical complexity and the approximability of different variants of the two gene orders alignment problem by duplications and losses. Then, we present the existing exact exponential time algorithms, and develop efficient heuristics for these problems. First, we developed DLAlign, a quadratic time heuristic for the two gene orders alignment problem by duplications and losses. Then, we developed OrthoAlign, a generalization of DLAlign, accounting for most genome-wide evolutionary events such as duplications, losses, rearrangements and substitutions. We also study the phylogenetic alignment problem. First, we adapt our heuristic OrthoAlign in order to infer ancestral genomes at the internal nodes of a given phylogenetic tree. Finally, we developed MultiOrthoAlign, a more robust heuristic, based on the median problem, for the inference of ancestral genomes and evolutionary histories of extent genomes labeling leaves of a phylogenetic tree

    Gene family-free genome comparison

    Get PDF
    Dörr D. Gene family-free genome comparison. Bielefeld: Universität Bielefeld; 2016.Computational comparative genomics offers valuable insights into the shared and individual evolutionary histories of living and extinct species and expands our understanding of cellular processes in living cells. Comparing genomes means identifying differences that originated from mutational modifications in their evolutionary past. In studying genome evolution, one differentiates between point mutations, genome rearrangements, and content modifications. Point mutations affect one or few consecutive nucleotide bases in the DNA sequence, whereas genome rearrangements operate on larger genomic regions, thereby altering the order and composition of genes in chromosomal sequences. Lastly, content modifications are a result of gene family evolution that causes gene duplications and losses. Genome rearrangement studies commonly assume that evolutionary relationships between all pairs of genes are resolved. Based on the biological concept of homology, the set of genes can be partitioned into gene families. All genes in a gene family are homologous, i.e., they evolved from the same ancestral sequence. Homology information is generally not given, hence gene families are commonly predicted computationally on the basis of sequence similarity or higher order features of their gene products. These predictions are often unreliable, leading to errors in subsequent genome rearrangement studies. In an attempt to avoid errors resulting from incorrect or incomplete gene family assignments, we develop new methods for genome rearrangement studies that do not require prior knowledge of gene family assignments of genes. Our approach, called gene family-free genome comparison, is innovative in that we account for differences between genes caused by point mutations while studying their order and composition in chromosomes. In lieu of gene family assignments, our proposed methods rely on pairwise similarities between genes. In practice, we obtain gene similarities from the conservation of their protein sequences. Two genes that are located next to each other on a chromosome are said to be adjacent, their adjoining extremities form an adjacency. The number of conserved adjacencies, i.e., those adjacencies that are common to two genomes, gives rise to a measure for gene~order-based genome similarity. If the gene content of both genomes is identical, the number of conserved adjacencies is the dual measure of the well-known breakpoint distance. We study the problem of computing the number of conserved adjacencies in a family-free setting, which relies on pairwise similarities between genes. We analyze its computational complexity and develop exact and heuristic algorithms for its solution in pairwise comparisons. We then advance to the problem of reconstructing ancestral sequences. Given three genomes, we study the problem of constructing a fourth genome, called the median, which maximizes a family-free, pairwise measure of conserved adjacencies between the median and each of the three given genomes. Our model is a family-free generalization of the well-studied mixed multichromosomal breakpoint median. We show that this problem is NP-hard and devise an exact algorithm for its solution. Gene orders become increasingly scrambled over longer evolutionary periods of time. In distant genomes, gene order analyses based on identifying pairs of conserved adjacencies might no longer be informative. Yet, relaxed constraints of gene order conservation are still able to capture weaker, but nonetheless existing remnants of common ancestral gene order, which leads to the problem of identifying syntenic blocks in two or more genomes. Knowing the evolutionary relationships between genes, one can assign a unique character to each gene family and represent a chromosome by a string drawn from the alphabet of gene family characters. Two intervals from two strings are called common intervals if the sets of characters within these intervals are identical. We extend this concept to indeterminate strings, which are a class of strings that have at every position a non-empty set of characters. We propose several models of common intervals in indeterminate strings and devise efficient algorithms for their corresponding discovery problems. Subsequently, we use the concept of common intervals in indeterminate strings to identify syntenic regions in a gene family-free setting. We evaluate all our proposed models and algorithms on simulated or biological datasets and assess their performance and applicability in gene family-free genome analyses

    Enhance the understanding of whole-genome evolution by designing, accelerating and parallelizing phylogenetic algorithms

    Get PDF
    The advent of new technology enhance the speed and reduce the cost for sequencing biological data. Making biological sense of this genomic data is a big challenge to the algorithm design as well as the high performance computing society. There are many problems in Bioinformatics, such as how new functional genes arise, why genes are organized into chromosomes, how species are connected through the evolutionary tree of life, or why arrangements are subject to change. Phylogenetic analyses have become essential to research on the evolutionary tree of life. It can help us to track the history of species and the relationship between different genes or genomes through millions of years. One of the fundamentals for phylogenetic construction is the computation of distances between genomes. Since there are much more complicated combinatoric patterns in rearrangement events, the distance computation is still a hot topic as much belongs to mathematics as to biology. For the distance computation with input of two genomes containing unequal gene contents (with insertions/deletions and duplications) the problem is especially hard. In this thesis, we will discuss about our contributions to the distance estimation for unequal gene order data. The problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. For genomes with unequal contents, to the best of our knowledge, there is no algorithm that can help to find the median. In this thesis, we make our contributions to the median computation in two aspects. 1) Algorithm engineering aspect, we harness the power of streaming graph analytics methods to implement an exact DCJ median algorithm which run as fast as the heuristic algorithm and can help construct a better phylogenetic tree. 2) Algorithmic aspect, we theoretically formulate the problem of finding median with input of genomes having unequal gene content, which leads to the design and implementation of an efficient Lin-Kernighan heuristic based median algorithm. Inferring phylogenies (evolutionary history) of a set of given species is the ultimate goal when the distance and median model are chosen. For more than a decade, biologists and computer scientists have studied how to infer phylogenies by the measurement of genome rearrangement events using gene order data. While evolution is not an inherently parsimonious process, maximum parsimony (MP) phylogenetic analysis has been supported by widely applied to the phylogeny inference to study the evolutionary patterns of genome rearrangements. There are generally two problems with the MP phylogenetic arose by genome rearrangement: One is, given a set of modern genomes, how to compute the topologies of the according phylogenetic tree; Another is, given the topology of a model tree, how to infer the gene orders of the ancestor species. To assemble a MP phylogenetic tree constructor, there are multiple NP hard problems involved, unfortunately, they organized as one problem on top of other problems. Which means, to solve a NP hard problem, we need to solve multiple NP hard sub-problems. For phylogenetic tree construction with the input of unequal content genomes, there are three layers of NP hard problems. In this thesis, we will mainly discuss about our contributions to the design and implementation of the software package DCJUC (Phylogeny Inference using DCJ model to cope with Unequal Content Genomes), that can help to achieve both of these two goals. Aside from the biological problems, another issue we need to concern is about the use of the power of parallel computing to assist accelerating algorithms to handle huge data sets, such as the high resolution gene order data. For one thing, all of the method to tackle with phylogenetic problems are based on branch and bound algorithms, which are quite irregular and unfriendly to parallel computing. To parallelize these algorithms, we need to properly enhance the efficiency for localized memory access and load balance methods to make sure that each thread can put their potentials into full play. For the other, there is a revolution taking place in computing with the availability of commodity graphical processors such as Nvidia GPU and with many-core CPUs such as Cray-XMT, or Intel Xeon Phi Coprocessor with 60 cores. These architectures provide a new way for us to achieve high performance at much lower cost. However, code running on these machines are not so easily programmed, and scientific computing is hard to tune well on them. We try to explore the potentials of these architectures to help us accelerate branch and bound based phylogenetic algorithms.Ph.D
    corecore