22 research outputs found

    Approximating the DCJ distance of balanced genomes in linear time

    Get PDF
    Rubert D, Feij茫o P, Dias Vieira Braga M, Stoye J, Martinez FHV. Approximating the DCJ distance of balanced genomes in linear time. Algorithms for Molecular Biology. 2017;12(1): 3.Background Rearrangements are large-scale mutations in genomes, responsible for complex changes and structural variations. Most rearrangements that modify the organization of a genome can be represented by the double cut and join (DCJ) operation. Given two balanced genomes, i.e., two genomes that have exactly the same number of occurrences of each gene in each genome, we are interested in the problem of computing the rearrangement distance between them, i.e., finding the minimum number of DCJ operations that transform one genome into the other. This problem is known to be NP-hard. Results We propose a linear time approximation algorithm with approximation factor O(k) for the DCJ distance problem, where k is the maximum number of occurrences of any gene in the input genomes. Our algorithm works for linear and circular unichromosomal balanced genomes and uses as an intermediate step an O(k)-approximation for the minimum common string partition problem, which is closely related to the DCJ distance problem. Conclusions Experiments on simulated data sets show that our approximation algorithm is very competitive both in efficiency and in quality of the solutions

    The Distance and Median Problems in the Single-Cut-Or-Join Model with Single-Gene Duplications

    Get PDF
    Background. In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model. Results. We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data. Conclusion. Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances

    Sampling and counting genome rearrangement scenarios

    Get PDF
    Even for moderate size inputs, there are a tremendous number of optimal rearrangement scenarios, regardless what the model is and which specific question is to be answered. Therefore giving one optimal solution might be misleading and cannot be used for statistical inferring. Statistically well funded methods are necessary to sample uniformly from the solution space and then a small number of samples are sufficient for statistical inferring

    Representing and decomposing genomic structural variants as balanced integer flows on sequence graphs

    Get PDF
    The study of genomic variation has provided key insights into the functional role of mutations. Predominantly, studies have focused on single nucleotide variants (SNV), which are relatively easy to detect and can be described with rich mathematical models. However, it has been observed that genomes are highly plastic, and that whole regions can be moved, removed or duplicated in bulk. These structural variants (SV) have been shown to have significant impact on the phenotype, but their study has been held back by the combinatorial complexity of the underlying models. We describe here a general model of structural variation that encompasses both balanced rearrangements and arbitrary copy-numbers variants (CNV). In this model, we show that the space of possible evolutionary histories that explain the structural differences between any two genomes can be sampled ergodically

    A Unifying Model of Genome Evolution Under Parsimony

    Get PDF
    We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph GG, a finite set of AVGs describe all parsimonious interpretations of GG, and this set can be explored with a few sampling moves.Comment: 52 pages, 24 figure

    Computing the family-free DCJ similarity

    Get PDF
    Rubert D, Hoshino EA, Dias Vieira Braga M, Stoye J, Martinez FHV. Computing the family-free DCJ similarity. BMC Bioinformatics. 2018;19(Suppl. 6): 152.Background The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity. Results We propose an exact ILP algorithm to solve the family-free DCJ similarity problem, then we show its APX-hardness and present four combinatorial heuristics with computational experiments comparing their results to the ILP. Conclusions We show that the family-free DCJ similarity can be computed in reasonable time, although for larger genomes it is necessary to resort to heuristics. This provides a basis for further studies on the applicability and model refinement of family-free whole genome similarity measures

    Models and Algorithms for Whole-Genome Evolution and their Use in Phylogenetic Inference

    Get PDF
    The rapid accumulation of sequenced genomes offers the chance to resolve longstanding questions about the evolutionary histories, or phylogenies, of groups of organisms. The relatively rare occurrence of large-scale evolutionary events in a whole genome, events such as genome rearrangements, duplications and losses, enables us to extract a strong and robust phylogenetic signal from whole-genome data. The work presented in this dissertation focuses on models and algorithms for whole-genome evolution and their use in phylogenetic inference. We designed algorithms to estimate pairwise genomic distances from large-scale genomic changes. We refined the evolutionary models on whole-genome evolution. We also made use of these results to provide fast and accurate methods for phylogenetic inference, that scales up, in both speed and accuracy, to modern high-resolution whole-genome data. We designed algorithms to estimate the true evolutionary distance between two genomes under genome rearrangements, and also under rearrangements, plus gains and losses. We refined the evolutionary model to be the first mathematical model to preserve the structural dichotomy in genomic organization between most prokaryotes and most eukaryotes. Those models and associated distance estimators provide a basis for studying facets of possible mechanisms of evolution through simulation and application to real genomes. Phylogenetic analyses from whole-genome data have been limited to small collections of genomes and low-resolution data; they have also lacked an effective assessment of robustness. We developed an approach that combines our distance estimator, any standard distance-based reconstruction algorithm, and a novel bootstrapping method based on resampling genomic adjacencies. The resulting tool overcomes a serious and long-standing impediment to the use of whole-genome data in phylogenetic inference and provides results comparable in accuracy and robustness to distance-based methods for sequence data. Maximum-likelihood approaches have been successfully applied to phylogenetic inferences for aligned sequences, but such applications remain primitive for whole-genome data. We developed a maximum-likelihood approach to phylogenetic analysis from whole-genome data. In combination with our bootstrap scheme, this new approach yields the first reliable phylogenetic tool for the analysis of whole-genome data at the level of syntenic blocks

    Applications of heuristic search on phylogeny reconstruction problems

    Get PDF
    Phylogenies or evolutionary trees for a given family of species show the evolutionary relationships between these species. The leaves denote the given species, the internal nodes denote their common ancestors and the edges denote the genetic relationships. Species can be identified by their whole genomes and the evolutionary relations between species can be measured by the number of rearrangement events (i.e. mutations) that transform one genome into another. One approach to infer phylogeny from genomic data is by solving median genome problems for three genomes, or the genome rearrangement problem for pairs of genomes, while trying to minimize the total evolutionary distance among the given species. In this thesis, we have developed and implemented two search based algorithms for phylogeny reconstruction problem based on solving median genome problems for circular genomes of the same length without gene duplication. In order to show applicability and effectiveness of our algorithms, we have tested them with randomly generated instances and two real data sets: mitochondrial genomes of Metazoa and chloroplast genomes of Campanulaceae
    corecore