15 research outputs found

    MLGO: phylogeny reconstruction and ancestral inference from gene-order data

    Get PDF
    Background The rapid accumulation of whole-genome data has renewed interest in the study of using gene-order data for phylogenetic analyses and ancestral reconstruction. Current software and web servers typically do not support duplication and loss events along with rearrangements. Results MLGOMLGO (Maximum Likelihood for Gene-Order Analysis) is a web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGOMLGO is based on likelihood computation and shows advantages over existing methods in terms of accuracy, scalability and flexibility. Conclusions To the best of our knowledge, it is the first web tool for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. The web tool is available from http://www.geneorder.org/server.php

    A mixture framework for inferring ancestral gene orders

    Full text link

    Evolutionary histories of legume genomes and mechanisms of genome remodeling

    Get PDF
    Evolutionary genomics analysis of plants aims to reveal and help us to understand the history of genome evolution that plants have undergone. So far, many specific topics and questions of genome evolution have been studied and answered. However, there are still a large number of questions to which the answers are unknown or not clear. In this dissertation, I focus on two specific topics of evolutionary genomics: (1) genome size evolution following genomic rearrangements in plants; (2) ancestral genome reconstruction in legumes. Using a model of two wild peanut relatives in which one genome experienced large rearrangements, we find that the main determinant in genome size reduction is a set of inversions which experienced subsequent net sequence removal in the inverted regions. We observe a general pattern in which sequence is lost more rapidly at newly distal (telomeric) regions than it is gained at newly proximal (pericentromeric) regions – resulting in net sequence loss in the inverted regions. The major driver of this process is recombination, determined by the chromosomal location. Any type of genomic rearrangement that exposes proximal regions to higher recombination rates can cause genome size reduction by this mechanism. Sequence loss in those regions was primarily due to removal of transposable elements. Illegitimate recombination is likely the major mechanism responsible for the sequence removal, rather than unequal intrastrand recombination. We also measure the relative rate of genome size reduction in these two Arachis diploids. We also test our model in other plant species and find that it applies in all cases examined, suggesting our model is widely applicable. Inversions occurring in tetraploid cultivated peanut after the polyploidization event provide us an excellent opportunity to examine the model of genome size reduction following genomic rearrangements in polyploidy. It is also a good opportunity to understand the genome size reduction process at its early stage, since the inversions are quite recent (likely younger than 10,000 years). We observe that the model of genome size reduction still holds in the recently-derived tetraploid peanut as it does in the much earlier-diverging diploid progenitors. We find that the genome size reduction process starts with differences in very long sequence deletions and then spreads to mid-length sequence deletions later. We measure the relative rate of size reduction of the inverted region in tetraploid peanut, finding that it is higher than the rates calculated in our previous study between Arachis diploids. We argue this is because the rate of size reduction is more rapid in the early generations after the inversion. We describe the reconstruction of a hypothetical ancestral genome for the papilionoid legumes, in order to help us better understand the evolutionary histories of these legumes. We use a novel method for identifying informative markers, to reconstruct the ancestral genomes for selected legume species, including Glycine max, which has a recent exclusive WGD event. We infer that the reconstructed most recent common ancestor of all selected legume species (all within the Papilionoideae) has 9 chromosomes. The model then predicts that chromosome numbers reduced to 8 in Medicago truncatula and Cicer arietinum separately, through two separate single fusion events. In Lotus japonicus, a series of rearrangement events is the major cause of the chromosome number reduction to 6. We infer that the chromosome number increased mostly independently in Cajanus cajan, Glycine max, Phaseolus vulgaris and Vigna radiata. In Arachis (A. duranensis and A. ipaensis), there was an increase in chromosome number prior to their divergence. The chromosome structural evolution described here is consistent with the phylogenetic distribution of a large collection of chromosome counts in the legumes

    Reconstructing Yeasts Phylogenies and Ancestors from Whole Genome Data

    Get PDF
    Phylogenetic studies aim to discover evolutionary relationships and histories. These studies are based on similarities of morphological characters and molecular sequences. Currently, widely accepted phylogenetic approaches are based on multiple sequence alignments, which analyze shared gene datasets and concatenate/coalesce these results to a final phylogeny with maximum support. However, these approaches still have limitations, and often have conflicting results with each other. Reconstructing ancestral genomes helps us understand mechanisms and corresponding consequences of evolution. Most existing genome level phylogeny and ancestor reconstruction methods can only process simplified real genome datasets or simulated datasets with identical genome content, unique genome markers, and limited types of evolutionary events. Here, we provide an alternative way to resolve phylogenetic problems based on analyses of real genome data. We use phylogenetic signals from all types of genome level evolutionary events, and overcome the conflicting issues existing in traditional phylogenetic approaches. Further, we build an automated computational pipeline to reconstruct phylogenies and ancestral genomes for two high-resolution real yeast genome datasets. Comparison results with recent studies and publications show that we reconstruct very accurate and robust phylogenies and ancestors. Finally, we identify and analyze the conserved syntenic blocks among reconstructed ancestral genomes and present yeast species

    Reconstructing Yeasts Phylogenies and Ancestors from Whole Genome Data

    Get PDF
    Phylogenetic studies aim to discover evolutionary relationships and histories. These studies are based on similarities of morphological characters and molecular sequences. Currently, widely accepted phylogenetic approaches are based on multiple sequence alignments, which analyze shared gene datasets and concatenate/coalesce these results to a final phylogeny with maximum support. However, these approaches still have limitations, and often have conflicting results with each other. Reconstructing ancestral genomes helps us understand mechanisms and corresponding consequences of evolution. Most existing genome level phylogeny and ancestor reconstruction methods can only process simplified real genome datasets or simulated datasets with identical genome content, unique genome markers, and limited types of evolutionary events. Here, we provide an alternative way to resolve phylogenetic problems based on analyses of real genome data. We use phylogenetic signals from all types of genome level evolutionary events, and overcome the conflicting issues existing in traditional phylogenetic approaches. Further, we build an automated computational pipeline to reconstruct phylogenies and ancestral genomes for two high-resolution real yeast genome datasets. Comparison results with recent studies and publications show that we reconstruct very accurate and robust phylogenies and ancestors. Finally, we identify and analyze the conserved syntenic blocks among reconstructed ancestral genomes and present yeast species

    Phylogeny and Ancestral Genome Reconstruction from Gene Order Using Maximum Likelihood and Binary Encoding

    Get PDF
    Over the long history of genome evolution, genes get rearranged under events such as rearrangements, losses, insertions and duplications, which in all change the ordering and content along the genome. Recent progress in genome-scale sequencing renews the challenges in the reconstructions of phylogeny and ancestral genomes with gene-order data. Such problems have been proved so interesting that a large number of algorithms have been developed rigorously over the past few years in attempts to tackle these problems following various principles. However, difficulties and limitations in performance and scalability largely prevent us from analyzing emerging modern whole-genome data, our study presented in this dissertation focuses on developing appropriate evolutionary models and robust algorithms for solving the phylogenetic and ancestral inference problems using gene-order data under the whole-genome evolution, along with their applications. To reconstruct phylogenies from gene-order data, we developed a collection of closely-related methods following the principle of likelihood maximization. To the best of our knowledge, it was the first successful attempt to apply maximum likelihood optimization technique into the analysis of gene-order phylogenetic problem. Later we proposed MLWD (in collaboration with Lin and Moret) in which we described an effective transition model to account for the transitions between presence and absence states of an gene adjacency. Besides genome rearrangements, other evolutionary events modify gene contents such as gene duplications and gene insertion/deletion (indels) can be naturally processed as well. We present our results from extensive testing on simulated data showing that our approach returns very accurate results very quickly. With a known phylogeny, a subsequent problem is to reconstruct the gene-order of ancestral genomes from their living descendants. To solve this problem, we adopted an adjacency-based probabilistic framework, and developed a method called PMAG. PMAG decomposes gene orderings into a set of gene adjacencies and then infers the probability of observing each adjacency in the ancestral genome. We conducted extensive simulation experiments and compared PMAG with InferCarsPro, GASTS, GapAdj and SCJ. According to the results, PMAG demonstrated great performance in terms of the true positive rate of gene adjacency. PMAG also achieved comparable running time to the other methods, even when the traveling sales man problem (TSP) were exactly solved. Although PMAG can give good performance, it is strongly restricted from analyzing datasets underwent only rearrangements. To infer ancestral genomes under a more general model of evolution with an arbitrary rate of indels , we proposed an enhanced method PMAG+ based on PMAG. PMAG+ includes a novel approach to infer ancestral gene contents and a detail description to reduce the adjacency assembly problem to an instance of TSP. We designed a series of experiments to validate PMAG+ and compared the results with the most recent and comparable method GapAdj. According to the results, ancestral gene contents predicted by PMAG+ coincided highly with the actual contents with error rates less than 1%. Under various degrees of indels, PMAG+ consistently achieved more accurate prediction of ancestral gene orders and at the same time, produced contigs very close to the actual chromosomes

    Robustness Evaluation for Phylogenetic Reconstruction Methods and Evolutionary Models Reconstruction of Tumor Progression

    Get PDF
    During evolutionary history, genomes evolve by DNA mutation, genome rearrangement, duplication and gene loss events. There has been endless effort to the phylogenetic and ancestral genome inference study. Due to the great development of various technology, the information about genomes is exponentially increasing, which make it possible figure the problem out. The problem has been shown so interesting that a great number of algorithms have been developed rigorously over the past decades in attempts to tackle these problems following different kind of principles. However, difficulties and limits in performance and capacity, and also low consistency largely prevent us from confidently statement that the problem is solved. To know the detailed evolutionary history, we need to infer the phylogeny of the evolutionary history (Big Phylogeny Problem) and also infer the internal nodes information (Small Phylogeny Problem). The work presented in this thesis focuses on assessing methods designed for attacking Small Phylogeny Problem and algorithms and models design for genome evolution history inference from FISH data for cancer data. During the recent decades, a number of evolutionary models and related algorithms have been designed to infer ancestral genome sequences or gene orders. Due to the difficulty of knowing the true scenario of the ancestral genomes, there must be some tools used to test the robustness of the adjacencies found by various methods. When it comes to methods for Big Phylogeny Problem, to test the confidence rate of the inferred branches, previous work has tested bootstrapping, jackknifing, and isolating and found them good resampling tools to corresponding phylogenetic inference methods. However, till now there is still no system work done to try and tackle this problem for small phylogeny. We tested the earlier resampling schemes and a new method inversion on different ancestral genome reconstruction methods and showed different resampling methods are appropriate for their corresponding methods. Cancer is famous for its heterogeneity, which is developed by an evolutionary process driven by mutations in tumor cells. Rapid, simultaneous linear and branching evolution has been observed and analyzed by earlier research. Such process can be modeled by a phylogenetic tree using different methods. Previous phylogenetic research used various kinds of dataset, such as FISH data, genome sequence, and gene order. FISH data is quite clean for the reason that it comes form single cells and shown to be enough to infer evolutionary process for cancer development. RSMT was shown to be a good model for phylogenetic analysis by using FISH cell count pattern data, but it need efficient heuristics because it is a NP-hard problem. To attack this problem, we proposed an iterative approach to approximate solutions to the steiner tree in the small phylogeny tree. It is shown to give better results comparing to earlier method on both real and simulation data. In this thesis, we continued the investigation on designing new method to better approximate evolutionary process of tumor and applying our method to other kinds of data such as information using high-throughput technology. Our thesis work can be divided into two parts. First, we designed new algorithms which can give the same parsimony tree as exact method in most situation and modified it to be a general phylogeny building tool. Second, we applied our methods to different kinds data such as copy number variation information inferred form next generation sequencing technology and predict key changes during evolution

    Phylogeny, Ancestral Genome, And Disease Diagnoses Models Constructions Using Biological Data

    Get PDF
    Studies of bioinformatics develop methods and software tools to analyze the biological data and provide insight of the mechanisms of biological process. Machine learning techniques have been widely used by researchers for disease prediction, disease diagnosis, and bio-marker identification. Using machine-learning algorithms to diagnose diseases has a couple of advantages. Besides solely relying on the doctors’ experiences and stereotyped formulas, researchers could use learning algorithms to analyze sophisticated, high-dimensional and multimodal biomedical data, and construct prediction/classification models to make decisions even when some information was incomplete, unknown, or contradictory. In this study, first of all, we built an automated computational pipeline to reconstruct phylogenies and ancestral genomes for two high-resolution real yeast whole genome datasets. Furthermore, we compared the results with recent studies and publications to show that we reconstruct very accurate and robust phylogenies, as well as ancestors. We also identified and analyzed conserved syntenic blocks among reconstructed ancestral genomes and present yeast species. Next, we analyzed the metabolic level dataset obtained from positive mass spectrometry of human blood samples. We applied machine learning algorithms and feature selection algorithms to construct diagnosis models of Chronic kidney diseases (CKD). We also identified the most critical metabolite features and studied the correlations v among the metabolite features and the developments of CKD stages. The selected metabolite features provided insights into CKD early stage diagnosis, pathophysiological mechanisms, CKD treatments, and medicine development. Finally, we used deep learning techniques to build accurate Down Syndrome (DS) prediction/screening models based on the analysis of newly introduced Illumina human genome genotyping array. We proposed a bi-stream convolutional neural network (CNN) architecture with ten layers and two merged CNN models, which took two input chromosome SNP maps in combination. We evaluated and compared the performances of our CNN DS predictions models with conventional machine learning algorithms. We visualized the feature maps and trained filter weights from intermediate layers of our trained CNN model. We further discussed the advantages of our method and the underlying reasons for the differences of their performances
    corecore