649 research outputs found

    Models and Algorithms for Whole-Genome Evolution and their Use in Phylogenetic Inference

    Get PDF
    The rapid accumulation of sequenced genomes offers the chance to resolve longstanding questions about the evolutionary histories, or phylogenies, of groups of organisms. The relatively rare occurrence of large-scale evolutionary events in a whole genome, events such as genome rearrangements, duplications and losses, enables us to extract a strong and robust phylogenetic signal from whole-genome data. The work presented in this dissertation focuses on models and algorithms for whole-genome evolution and their use in phylogenetic inference. We designed algorithms to estimate pairwise genomic distances from large-scale genomic changes. We refined the evolutionary models on whole-genome evolution. We also made use of these results to provide fast and accurate methods for phylogenetic inference, that scales up, in both speed and accuracy, to modern high-resolution whole-genome data. We designed algorithms to estimate the true evolutionary distance between two genomes under genome rearrangements, and also under rearrangements, plus gains and losses. We refined the evolutionary model to be the first mathematical model to preserve the structural dichotomy in genomic organization between most prokaryotes and most eukaryotes. Those models and associated distance estimators provide a basis for studying facets of possible mechanisms of evolution through simulation and application to real genomes. Phylogenetic analyses from whole-genome data have been limited to small collections of genomes and low-resolution data; they have also lacked an effective assessment of robustness. We developed an approach that combines our distance estimator, any standard distance-based reconstruction algorithm, and a novel bootstrapping method based on resampling genomic adjacencies. The resulting tool overcomes a serious and long-standing impediment to the use of whole-genome data in phylogenetic inference and provides results comparable in accuracy and robustness to distance-based methods for sequence data. Maximum-likelihood approaches have been successfully applied to phylogenetic inferences for aligned sequences, but such applications remain primitive for whole-genome data. We developed a maximum-likelihood approach to phylogenetic analysis from whole-genome data. In combination with our bootstrap scheme, this new approach yields the first reliable phylogenetic tool for the analysis of whole-genome data at the level of syntenic blocks

    TIBA: a tool for phylogeny inference from rearrangement data with bootstrap analysis

    Get PDF
    Summary: TIBA is a tool to reconstruct phylogenetic trees from rearrangement data that consist of ordered lists of synteny blocks (or genes), where each synteny block is shared with all of its homologues in the input genomes. The evolution of these synteny blocks, through rearrangement operations, is modelled by the uniform Double-Cut-and-Join model. Using a true distance estimate under this model and simple distance-based methods, TIBA reconstructs a phylogeny of the input genomes. Unlike any previous tool for inferring phylogenies from rearrangement data, TIBA uses novel methods of robustness estimation to provide support values for the edges in the inferred tree. Availability: http://lcbb.epfl.ch/softwares/tiba.html. Contact: [email protected]

    TIBA: a tool for phylogeny inference from rearrangement data with bootstrap analysis

    Get PDF
    TIBA is a tool to reconstruct phylogenetic trees from rearrangement data that consist of ordered lists of synteny blocks (or genes), where each synteny block is shared with all of its homologues in the input genomes. The evolution of these synteny blocks, through rearrangement operations, is modelled by the uniform Double-Cut-and-Join model. Using a true distance estimate under this model and simple distance-based methods, TIBA reconstructs a phylogeny of the input genomes. Unlike any previous tool for inferring phylogenies from rearrangement data, TIBA uses novel methods of robustness estimation to provide support values for the edges in the inferred tree

    Phylogeny and Ancestral Genome Reconstruction from Gene Order Using Maximum Likelihood and Binary Encoding

    Get PDF
    Over the long history of genome evolution, genes get rearranged under events such as rearrangements, losses, insertions and duplications, which in all change the ordering and content along the genome. Recent progress in genome-scale sequencing renews the challenges in the reconstructions of phylogeny and ancestral genomes with gene-order data. Such problems have been proved so interesting that a large number of algorithms have been developed rigorously over the past few years in attempts to tackle these problems following various principles. However, difficulties and limitations in performance and scalability largely prevent us from analyzing emerging modern whole-genome data, our study presented in this dissertation focuses on developing appropriate evolutionary models and robust algorithms for solving the phylogenetic and ancestral inference problems using gene-order data under the whole-genome evolution, along with their applications. To reconstruct phylogenies from gene-order data, we developed a collection of closely-related methods following the principle of likelihood maximization. To the best of our knowledge, it was the first successful attempt to apply maximum likelihood optimization technique into the analysis of gene-order phylogenetic problem. Later we proposed MLWD (in collaboration with Lin and Moret) in which we described an effective transition model to account for the transitions between presence and absence states of an gene adjacency. Besides genome rearrangements, other evolutionary events modify gene contents such as gene duplications and gene insertion/deletion (indels) can be naturally processed as well. We present our results from extensive testing on simulated data showing that our approach returns very accurate results very quickly. With a known phylogeny, a subsequent problem is to reconstruct the gene-order of ancestral genomes from their living descendants. To solve this problem, we adopted an adjacency-based probabilistic framework, and developed a method called PMAG. PMAG decomposes gene orderings into a set of gene adjacencies and then infers the probability of observing each adjacency in the ancestral genome. We conducted extensive simulation experiments and compared PMAG with InferCarsPro, GASTS, GapAdj and SCJ. According to the results, PMAG demonstrated great performance in terms of the true positive rate of gene adjacency. PMAG also achieved comparable running time to the other methods, even when the traveling sales man problem (TSP) were exactly solved. Although PMAG can give good performance, it is strongly restricted from analyzing datasets underwent only rearrangements. To infer ancestral genomes under a more general model of evolution with an arbitrary rate of indels , we proposed an enhanced method PMAG+ based on PMAG. PMAG+ includes a novel approach to infer ancestral gene contents and a detail description to reduce the adjacency assembly problem to an instance of TSP. We designed a series of experiments to validate PMAG+ and compared the results with the most recent and comparable method GapAdj. According to the results, ancestral gene contents predicted by PMAG+ coincided highly with the actual contents with error rates less than 1%. Under various degrees of indels, PMAG+ consistently achieved more accurate prediction of ancestral gene orders and at the same time, produced contigs very close to the actual chromosomes

    Phylogenetic Reconstruction Analysis on Gene Order and Copy Number Variation

    Get PDF
    Genome rearrangement is known as one of the main evolutionary mechanisms on the genomic level. Phylogenetic analysis based on rearrangement played a crucial role in biological research in the past decades, especially with the increasing avail- ability of fully sequenced genomes. In general, phylogenetic analysis aims to solve two problems: Small Parsimony Problem (SPP) and Big Parsimony Problem (BPP). Maximum parsimony is a popular approach for SPP and BPP which relies on itera- tively solving a NP-hard problem, the median problem. As a result, current median solvers and phylogenetic inference methods based on the median problem all face se- rious problems on scalability and cannot be applied to datasets with large and distant genomes. In this thesis, we propose a new median solver for gene order data that combines double-cut-join (DCJ) sorting with the Simulated Annealing algorithm (SA- Median). Based on this median solver, we built a new phylogenetic inference method to solve both SPP and BPP problems. Our experimental results show that the new median solver achieves an excellent performance on simulated datasets and the phylo- genetic inference tool built based on the new median solver has a better performance than other existing methods. Cancer is known for its heterogeneity and is regarded as an evolutionary process driven by somatic mutations and clonal expansions. This evolutionary process can be modeled by a phylogenetic tree and phylogenetic analysis of multiple subclones of cancer cells can facilitate the study of the tumor variants progression. Copy-number aberration occurs frequently in many types of tumors in terms of segmental ampli- fications and deletions. In this thesis, we developed a distance-based method for reconstructing phylogenies from copy-number profiles of cancer cells. We demon- strate the importance of distance correction from the edit (minimum) distance to the estimated actual number of events. Experimental results show that our approaches provide accurate and scalable results in estimating the actual number of evolutionary events between copy number profiles and in reconstructing phylogenies. High-throughput sequencing of tumor samples has reported various degrees of ge- netic heterogeneity between primary tumors and their distant subpopulations. The clonal theory of cancer evolution shows that tumor cells are descended from a common origin cell. This origin cell includes an advantageous mutation that cause a clonal expansion with a large amount of population of cells descended from the origin cell. To further investigate cancer progression, phylogenetic analysis on the tumor cells is imperative. In this thesis, we developed a novel approach to infer the phylogeny to analyze both Next-Generation Sequencing and Long-Read Sequencing data. Experi- mental results show that our new proposed method can infer the entire phylogenetic progression very accurately on both Next-Generation Sequencing and Long-Read Se- quencing data. In this thesis, we focused on phylogenetic analysis on both gene order sequence and copy number variations. Our thesis work can be categorized into three parts. First, we developed a new median solver to solve the median problem and phylogeny inference with DCJ model and apply our method to both simulated data and real yeast data. Second, we explored a new approach to infer the phylogeny of copy number profiles for a wide range of parameters (e.g., different number of leaf genomes, different number of positions in the genome, and different tree diameters). Third, we concentrated our work on the phylogeny inference on the high-throughput sequencing data and proposed a novel approach to further investigate and phylogenetic analyze the entire expansion process of cancer cells on both Next-Generation Sequencing and Long-Read Sequencing data

    Shigella sonnei genome sequencing and phylogenetic analysis indicate recent global dissemination from Europe

    Get PDF
    Shigella are human-adapted Escherichia coli that have gained the ability to invade the human gut mucosa and cause dysentery1,2, spreading efficiently via low-dose fecal-oral transmission3,4. Historically, S. sonnei has been predominantly responsible for dysentery in developed countries, but is now emerging as a problem in the developing world, apparently replacing the more diverse S. flexneri in areas undergoing economic development and improvements in water quality4-6. Classical approaches have shown S. sonnei is genetically conserved and clonal7. We report here whole-genome sequencing of 132 globally-distributed isolates. Our phylogenetic analysis shows that the current S. sonnei population descends from a common ancestor that existed less than 500 years ago and has diversified into several distinct lineages with unique characteristics. Our analysis suggests the majority of this diversification occurred in Europe, followed by more recent establishment of local pathogen populations in other continents predominantly due to the pandemic spread of a single, rapidly-evolving, multidrug resistant lineage

    Rec-DCM-Eigen: Reconstructing a Less Parsimonious but More Accurate Tree in Shorter Time

    Get PDF
    Maximum parsimony (MP) methods aim to reconstruct the phylogeny of extant species by finding the most parsimonious evolutionary scenario using the species' genome data. MP methods are considered to be accurate, but they are also computationally expensive especially for a large number of species. Several disk-covering methods (DCMs), which decompose the input species to multiple overlapping subgroups (or disks), have been proposed to solve the problem in a divide-and-conquer way

    Robustness Evaluation for Phylogenetic Reconstruction Methods and Evolutionary Models Reconstruction of Tumor Progression

    Get PDF
    During evolutionary history, genomes evolve by DNA mutation, genome rearrangement, duplication and gene loss events. There has been endless effort to the phylogenetic and ancestral genome inference study. Due to the great development of various technology, the information about genomes is exponentially increasing, which make it possible figure the problem out. The problem has been shown so interesting that a great number of algorithms have been developed rigorously over the past decades in attempts to tackle these problems following different kind of principles. However, difficulties and limits in performance and capacity, and also low consistency largely prevent us from confidently statement that the problem is solved. To know the detailed evolutionary history, we need to infer the phylogeny of the evolutionary history (Big Phylogeny Problem) and also infer the internal nodes information (Small Phylogeny Problem). The work presented in this thesis focuses on assessing methods designed for attacking Small Phylogeny Problem and algorithms and models design for genome evolution history inference from FISH data for cancer data. During the recent decades, a number of evolutionary models and related algorithms have been designed to infer ancestral genome sequences or gene orders. Due to the difficulty of knowing the true scenario of the ancestral genomes, there must be some tools used to test the robustness of the adjacencies found by various methods. When it comes to methods for Big Phylogeny Problem, to test the confidence rate of the inferred branches, previous work has tested bootstrapping, jackknifing, and isolating and found them good resampling tools to corresponding phylogenetic inference methods. However, till now there is still no system work done to try and tackle this problem for small phylogeny. We tested the earlier resampling schemes and a new method inversion on different ancestral genome reconstruction methods and showed different resampling methods are appropriate for their corresponding methods. Cancer is famous for its heterogeneity, which is developed by an evolutionary process driven by mutations in tumor cells. Rapid, simultaneous linear and branching evolution has been observed and analyzed by earlier research. Such process can be modeled by a phylogenetic tree using different methods. Previous phylogenetic research used various kinds of dataset, such as FISH data, genome sequence, and gene order. FISH data is quite clean for the reason that it comes form single cells and shown to be enough to infer evolutionary process for cancer development. RSMT was shown to be a good model for phylogenetic analysis by using FISH cell count pattern data, but it need efficient heuristics because it is a NP-hard problem. To attack this problem, we proposed an iterative approach to approximate solutions to the steiner tree in the small phylogeny tree. It is shown to give better results comparing to earlier method on both real and simulation data. In this thesis, we continued the investigation on designing new method to better approximate evolutionary process of tumor and applying our method to other kinds of data such as information using high-throughput technology. Our thesis work can be divided into two parts. First, we designed new algorithms which can give the same parsimony tree as exact method in most situation and modified it to be a general phylogeny building tool. Second, we applied our methods to different kinds data such as copy number variation information inferred form next generation sequencing technology and predict key changes during evolution

    CNETML: maximum likelihood inference of phylogeny from copy number profiles of multiple samples

    Get PDF
    Phylogenetic trees based on copy number profiles from multiple samples of a patient are helpful to understand cancer evolution. Here, we develop a new maximum likelihood method, CNETML, to infer phylogenies from such data. CNETML is the first program to jointly infer the tree topology, node ages, and mutation rates from total copy numbers of longitudinal samples. Our extensive simulations suggest CNETML performs well on copy numbers relative to ploidy and under slight violation of model assumptions. The application of CNETML to real data generates results consistent with previous discoveries and provides novel early copy number events for further investigation

    Streaming Breakpoint Graph Analytics for Accelerating and Parallelizing the Computation of DCJ Median of Three Genomes

    Get PDF
    AbstractThe problem of finding the median of three genomes is the key process in building the most parsimonious phylogenetic trees from genome rearrangement data. The median problem using Double-Cut-and-Join (DCJ) distance is NP-hard and the best exact algorithm is based on a branch-and-bound best-first search strategy to explore sub-graph patterns in Multiple BreakPoint Graph (MBG). In this paper, by taking advantage of the “streaming” property of MBG, we introduce the “footprint-based” data structure to reduce the space requirement of a single search nodes from O(v2) to O(v); minimize the redundant computation in counting cycles/paths to update bounds, which leads to dramatically decrease of workload of a single search node. Additional heuristic of branching strategy is introduced to help reducing the searching space. Last but not least, the introduction of a multi-thread shared memory parallel algorithm with two load balancing strategies bring in additional benefit by distributing search work efficiently among different processors. We conduct extensive experiments on simulated datasets and our results show significant improvement on all datasets. And we test our DCJ median algorithm with GASTS, a state of the art software phylogenetic tree construction package. On the real high resolution Drosophila data set, our exact algorithm run as fast as the heuristic algorithm and help construct a better phylogenetic tree
    • …
    corecore