521 research outputs found

    MLGO: phylogeny reconstruction and ancestral inference from gene-order data

    Get PDF
    Background The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements. Results MLGOMLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGOMLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility. Conclusions To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php

    Phylogeny Analysis from Gene-Order Data with Massive Duplications

    Get PDF
    Background: Gene order changes, under rearrangements, insertions, deletions and duplications, have been used as a new type of data source for phylogenetic reconstruction. Because these changes are rare compared to sequence mutations, they allow the inference of phylogeny further back in evolutionary time. There exist many computational methods for the reconstruction of gene-order phylogenies, including widely used maximum parsimonious methods and maximum likelihood methods. However, both methods face challenges in handling large genomes with many duplicated genes, especially in the presence of whole genome duplication. Methods: In this paper, we present three simple yet powerful methods based on maximum-likelihood (ML) approaches that encode multiplicities of both gene adjacency and gene content information for phylogenetic reconstruction. Results: Extensive experiments on simulated data sets show that our new method achieves the most accurate phylogenies compared to existing approaches. We also evaluate our method on real whole-genome data from eleven mammals. The package is publicly accessible at http://www.geneorder.org. Conclusions: Our new encoding schemes successfully incorporate the multiplicity information of gene adjacencies and gene content into an ML framework, and show promising results in reconstruct phylogenies for whole-genome data in the presence of massive duplications

    Phylogeny and Ancestral Genome Reconstruction from Gene Order Using Maximum Likelihood and Binary Encoding

    Get PDF
    Over the long history of genome evolution, genes get rearranged under events such as rearrangements, losses, insertions and duplications, which in all change the ordering and content along the genome. Recent progress in genome-scale sequencing renews the challenges in the reconstructions of phylogeny and ancestral genomes with gene-order data. Such problems have been proved so interesting that a large number of algorithms have been developed rigorously over the past few years in attempts to tackle these problems following various principles. However, difficulties and limitations in performance and scalability largely prevent us from analyzing emerging modern whole-genome data, our study presented in this dissertation focuses on developing appropriate evolutionary models and robust algorithms for solving the phylogenetic and ancestral inference problems using gene-order data under the whole-genome evolution, along with their applications. To reconstruct phylogenies from gene-order data, we developed a collection of closely-related methods following the principle of likelihood maximization. To the best of our knowledge, it was the first successful attempt to apply maximum likelihood optimization technique into the analysis of gene-order phylogenetic problem. Later we proposed MLWD (in collaboration with Lin and Moret) in which we described an effective transition model to account for the transitions between presence and absence states of an gene adjacency. Besides genome rearrangements, other evolutionary events modify gene contents such as gene duplications and gene insertion/deletion (indels) can be naturally processed as well. We present our results from extensive testing on simulated data showing that our approach returns very accurate results very quickly. With a known phylogeny, a subsequent problem is to reconstruct the gene-order of ancestral genomes from their living descendants. To solve this problem, we adopted an adjacency-based probabilistic framework, and developed a method called PMAG. PMAG decomposes gene orderings into a set of gene adjacencies and then infers the probability of observing each adjacency in the ancestral genome. We conducted extensive simulation experiments and compared PMAG with InferCarsPro, GASTS, GapAdj and SCJ. According to the results, PMAG demonstrated great performance in terms of the true positive rate of gene adjacency. PMAG also achieved comparable running time to the other methods, even when the traveling sales man problem (TSP) were exactly solved. Although PMAG can give good performance, it is strongly restricted from analyzing datasets underwent only rearrangements. To infer ancestral genomes under a more general model of evolution with an arbitrary rate of indels , we proposed an enhanced method PMAG+ based on PMAG. PMAG+ includes a novel approach to infer ancestral gene contents and a detail description to reduce the adjacency assembly problem to an instance of TSP. We designed a series of experiments to validate PMAG+ and compared the results with the most recent and comparable method GapAdj. According to the results, ancestral gene contents predicted by PMAG+ coincided highly with the actual contents with error rates less than 1%. Under various degrees of indels, PMAG+ consistently achieved more accurate prediction of ancestral gene orders and at the same time, produced contigs very close to the actual chromosomes

    Using Genetic Algorithm to solve Median Problem and Phylogenetic Inference

    Get PDF
    Genome rearrangement analysis has attracted a lot of attentions in phylogenetic com- putation and comparative genomics. Solving the median problems based on various distance definitions has been a focus as it provides the building blocks for maximum parsimony analysis of phylogeny and ancestral genomes. The Median Problem (MP) has been proved to be NP-hard and although there are several exact or heuristic al- gorithms available, these methods all are difficulty to compute distant three genomes containing high evolution events. Such as current approaches, MGR[1] and GRAPPA [2], are restricted on small collections of genomes and low-resolution gene order data of a few hundred rearrangement events. In my work, we focus on heuristic algorithms which will combine genomic sorting algorithm with genetic algorithm (GA) to pro- duce new methods and directions for whole-genome median solver, ancestor inference and phylogeny reconstruction. In equal median problem, we propose a DCJ sorting operation based genetic algorithms measurements, called GA-DCJ. Following classic genetic algorithm frame, we develop our algorithms for every procedure and substitute for each traditional genetic algorithm procedure. The final results of our GA-based algorithm are optimal median genome(s) and its median score. In limited time and space, especially in large scale and distant datasets, our algorithm get better results compared with GRAPPA and AsMedian. Extending the ideas of equal genome median solver, we develop another genetic algorithm based solver, GaDCJ-Indel, which can solve unequal genomes median prob- lem (without duplication). In DCJ-Indel model, one of the key steps is still sorting operation[3]. The difference with equal genomes median is there are two sorting di- rections: minimal DCJ operation path or minimal indel operation path. Following different sorting path, in each step scenario, we can get various genome structures to fulfill our population pool. Besides that, we adopt adaptive surcharge-triangle inequality instead of classic triangle inequality in our fitness function in order to fit unequal genome restrictions and get more efficient results. Our experiments results show that GaDCJ-Indel method not only can converge to accurate median score, but also can infer ancestors that are very close to the true ancestors. An important application of genome rearrangement analysis is to infer ancestral genomes, which is valuable for identifying patterns of evolution and for modeling the evolutionary processes. However, computing ancestral genomes is very difficult and we have to rely on heuristic methods that have various limitations. We propose a GA-Tree algorithm which adapts meta-population [4], co-evolution and repopulation pool methods In this paper, we describe and illuminate the first genetic algorithm for ancestor inference step by step, which uses fitness scores designed to consider co- evolution and uses sorting-based methods to initialize and evolve populations. Our extensive experiments show that compared with other existing tools, our method is accurate and can infer ancestors that are much closer to true ancestors

    A Hierarchical Framework for Phylogenetic and Ancestral Genome Reconstruction on Whole Genome Data

    Get PDF
    Gene order gets evolved under events such as rearrangements, duplications, and losses, which can change both the order and content along the genome, through the long history of genome evolution. Recently, the accumulation of genomic sequences provides researchers with the chance to handle long-standing problems about the phylogenies, or evolutionary histories, of sets of species, and ancestral genomic content and orders. Over the past few years, such problems have been proven so interesting that a large number of algorithms have been proposed in the attempt to resolve them, following different standards. The work presented in this dissertation focuses on algorithms and models for whole-genome evolution and their applications in phylogeny and ancestor inference from gene order. We developed a flexible ancestor reconstruction method (FARM) within the framework of maximum likelihood and weighted maximum matching. We designed binary code based framework to reconstruct evolutionary history for whole genome gene orders. We developed algorithms to estimate/predict missing adjacencies in ancestral reconstruction procedure to restore gene order from species, when leaf genomes are far from each other. We developed a pipeline involving maximum likelihood, weighted maximum matching and variable length binary encoding for estimation of ancestral gene content, to reconstruct ancestral genomes under the various evolutionary model, including genome rearrangements, additions, losses and duplications, with high accuracy and low time consumption. Phylogenetic analyses of whole-genome data have been limited to small collections of genomes and low-resolution data, or data without massive duplications. We designed a maximum-likelihood approach to phylogeny analysis (VLWD) based on variable length binary encoding, under maximum likelihood model, to reconstruct phylogenies from whole genome data, scaling up in accuracy and make it capable of reconstructing phylogeny from whole genome data, like triploids and tetraploids. Maximum likelihood based approaches have been applied to ancestral reconstruction but remain primitive for whole-genome data. We developed a hierarchical framework for ancestral reconstruction, using variable length binary encoding in content estimation, then adjacencies fixing and missing adjacencies predicting in adjacencies collection and finally, weighted maximum matching in gene order assembly. Therefore it extensively improves the performance of ancestral gene order reconstruction. We designed a series of experiments to validate these methods and compared the results with the most recent and comparable methods. According to the results, they are proven to be fast and accurate

    Integration of Alignment and Phylogeny in the Whole-Genome Era

    Get PDF
    With the development of new sequencing techniques, whole genomes of many species have become available. This huge amount of data gives rise to new opportunities and challenges. These new sequences provide valuable information on relationships among species, e.g. genome recombination and conservation. One of the principal ways to investigate such information is multiple sequence alignment (MSA). Currently, there is large amount of MSA data on the internet, such as the UCSC genome database, but how to effectively use this information to solve classical and new problems is still an area lacking of exploration. In this thesis, we explored how to use this information in four problems, i.e. sequence orthology search problem, multiple alignment improvement problem, short read mapping problem, and genome rearrangement inference problem. For the first problem, we developed a EM algorithm to iteratively align a query with a multiple alignment database with the information from a phylogeny relating the query species and the species in the multiple alignment. We also infer the query\u27s location in the phylogeny. We showed that by doing alignment and phylogeny inference together, we can improve the accuracies for both problems. For the second problem, we developed an optimization algorithm to iteratively refine the multiple alignment quality. Experiment results showed our algorithm is very stable in term of resulting alignments. The results showed that our method is more accurate than existing methods, i.e. Mafft, Clustal-O, and Mavid, on test data from three sets of species from the UCSC genome database. For the third problem, we developed a model, PhyMap, to align a read to a multiple alignment allowing mismatches and indels. PhyMap computes local alignments of a query sequence against a fixed multiple-genome alignment of closely related species. PhyMap uses a known phylogenetic tree on the species in the multiple alignment to improve the quality of its computed alignments while also estimating the placement of the query on this tree. Both theoretical computation and experiment results show that our model can differentiate between orthologous and paralogous alignments better than other popular short read mapping tools (BWA, BOWTIE and BLAST). For the fourth problem, we gave a simple genome recombination model which can express insertions, deletions, inversions, translocations and inverted translocations on aligned genome segments. We also developed an MCMC algorithm to infer the order of the query segments. We proved that using any Euclidian metrics to measure distance between two sequence orders in the tree optimization goal function will lead to a degenerated solution where the inferred order will be the order of one of the leaf nodes. We also gave a graph-based formulation of the problem which can represent the probability distribution of the order of the query sequences

    Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants

    Get PDF
    Rubert D, Martinez FHV, Stoye J, Dörr D. Analysis of local genome rearrangement improves resolution of ancestral genomic maps in plants. BMC Genomics. 2020;21(Suppl. 2): 273.Background Computationally inferred ancestral genomes play an important role in many areas of genome research. We present an improved workflow for the reconstruction from highly diverged genomes such as those of plants. Results Our work relies on an established workflow in the reconstruction of ancestral plants, but improves several steps of this process. Instead of using gene annotations for inferring the genome content of the ancestral sequence, we identify genomic markers through a process called genome segmentation. This enables us to reconstruct the ancestral genome from hundreds of thousands of markers rather than the tens of thousands of annotated genes. We also introduce the concept of local genome rearrangement, through which we refine syntenic blocks before they are used in the reconstruction of contiguous ancestral regions. With the enhanced workflow at hand, we reconstruct the ancestral genome of eudicots, a major sub-clade of flowering plants, using whole genome sequences of five modern plants. Conclusions Our reconstructed genome is highly detailed, yet its layout agrees well with that reported in Badouin et al. (2017). Using local genome rearrangement, not only the marker-based, but also the gene-based reconstruction of the eudicot ancestor exhibited increased genome content, evidencing the power of this novel concept

    Evolutionary histories of legume genomes and mechanisms of genome remodeling

    Get PDF
    Evolutionary genomics analysis of plants aims to reveal and help us to understand the history of genome evolution that plants have undergone. So far, many specific topics and questions of genome evolution have been studied and answered. However, there are still a large number of questions to which the answers are unknown or not clear. In this dissertation, I focus on two specific topics of evolutionary genomics: (1) genome size evolution following genomic rearrangements in plants; (2) ancestral genome reconstruction in legumes. Using a model of two wild peanut relatives in which one genome experienced large rearrangements, we find that the main determinant in genome size reduction is a set of inversions which experienced subsequent net sequence removal in the inverted regions. We observe a general pattern in which sequence is lost more rapidly at newly distal (telomeric) regions than it is gained at newly proximal (pericentromeric) regions – resulting in net sequence loss in the inverted regions. The major driver of this process is recombination, determined by the chromosomal location. Any type of genomic rearrangement that exposes proximal regions to higher recombination rates can cause genome size reduction by this mechanism. Sequence loss in those regions was primarily due to removal of transposable elements. Illegitimate recombination is likely the major mechanism responsible for the sequence removal, rather than unequal intrastrand recombination. We also measure the relative rate of genome size reduction in these two Arachis diploids. We also test our model in other plant species and find that it applies in all cases examined, suggesting our model is widely applicable. Inversions occurring in tetraploid cultivated peanut after the polyploidization event provide us an excellent opportunity to examine the model of genome size reduction following genomic rearrangements in polyploidy. It is also a good opportunity to understand the genome size reduction process at its early stage, since the inversions are quite recent (likely younger than 10,000 years). We observe that the model of genome size reduction still holds in the recently-derived tetraploid peanut as it does in the much earlier-diverging diploid progenitors. We find that the genome size reduction process starts with differences in very long sequence deletions and then spreads to mid-length sequence deletions later. We measure the relative rate of size reduction of the inverted region in tetraploid peanut, finding that it is higher than the rates calculated in our previous study between Arachis diploids. We argue this is because the rate of size reduction is more rapid in the early generations after the inversion. We describe the reconstruction of a hypothetical ancestral genome for the papilionoid legumes, in order to help us better understand the evolutionary histories of these legumes. We use a novel method for identifying informative markers, to reconstruct the ancestral genomes for selected legume species, including Glycine max, which has a recent exclusive WGD event. We infer that the reconstructed most recent common ancestor of all selected legume species (all within the Papilionoideae) has 9 chromosomes. The model then predicts that chromosome numbers reduced to 8 in Medicago truncatula and Cicer arietinum separately, through two separate single fusion events. In Lotus japonicus, a series of rearrangement events is the major cause of the chromosome number reduction to 6. We infer that the chromosome number increased mostly independently in Cajanus cajan, Glycine max, Phaseolus vulgaris and Vigna radiata. In Arachis (A. duranensis and A. ipaensis), there was an increase in chromosome number prior to their divergence. The chromosome structural evolution described here is consistent with the phylogenetic distribution of a large collection of chromosome counts in the legumes

    Molecular Decay of the Tooth Gene Enamelin (ENAM) Mirrors the Loss of Enamel in the Fossil Record of Placental Mammals

    Get PDF
    Vestigial structures occur at both the anatomical and molecular levels, but studies documenting the co-occurrence of morphological degeneration in the fossil record and molecular decay in the genome are rare. Here, we use morphology, the fossil record, and phylogenetics to predict the occurrence of “molecular fossils” of the enamelin (ENAM) gene in four different orders of placental mammals (Tubulidentata, Pholidota, Cetacea, Xenarthra) with toothless and/or enamelless taxa. Our results support the “molecular fossil” hypothesis and demonstrate the occurrence of frameshift mutations and/or stop codons in all toothless and enamelless taxa. We then use a novel method based on selection intensity estimates for codons (ω) to calculate the timing of iterated enamel loss in the fossil record of aardvarks and pangolins, and further show that the molecular evolutionary history of ENAM predicts the occurrence of enamel in basal representatives of Xenarthra (sloths, anteaters, armadillos) even though frameshift mutations are ubiquitous in ENAM sequences of living xenarthrans. The molecular decay of ENAM parallels the morphological degeneration of enamel in the fossil record of placental mammals and provides manifest evidence for the predictive power of Darwin's theory
    corecore