21 research outputs found

    DCJ-indel and DCJ-substitution distances with distinct operation costs

    Get PDF
    BACKGROUND: Classical approaches to compute the genomic distance are usually limited to genomes with the same content and take into consideration only rearrangements that change the organization of the genome (i.e. positions and orientation of pieces of DNA, number and type of chromosomes, etc.), such as inversions, translocations, fusions and fissions. These operations are generically represented by the double-cut and join (DCJ) operation. The distance between two genomes, in terms of number of DCJ operations, can be computed in linear time. In order to handle genomes with distinct contents, also insertions and deletions of fragments of DNA – named indels – must be allowed. More powerful than an indel is a substitution of a fragment of DNA by another fragment of DNA. Indels and substitutions are called content-modifying operations. It has been shown that both the DCJ-indel and the DCJ-substitution distances can also be computed in linear time, assuming that the same cost is assigned to any DCJ or content-modifying operation. RESULTS: In the present study we extend the DCJ-indel and the DCJ-substitution models, considering that the content-modifying cost is distinct from and upper bounded by the DCJ cost, and show that the distance in both models can still be computed in linear time. Although the triangular inequality can be disrupted in both models, we also show how to efficiently fix this problem a posteriori

    Generalizations of the genomic rank distance to indels

    Get PDF
    MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    On the Inversion-Indel Distance

    Get PDF
    Willing E, Zaccaria S, Dias Vieira Braga M, Stoye J. On the Inversion-Indel Distance. BMC Bioinformatics. 2013;14(Suppl 15: Proc. of RECOMB-CG 2013): S3.Background The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be computed thanks to a pioneering approach of Hannenhalli and Pevzner in 1995. In 2000, El-Mabrouk extended the inversion model to allow the comparison of unichromosomal genomes with unequal contents, thus insertions and deletions of DNA segments besides inversions. However, an exact algorithm was presented only for the case in which we have insertions alone and no deletion (or vice versa), while a heuristic was provided for the symmetric case, that allows both insertions and deletions and is called the inversion-indel distance. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). Among others, the DCJ model gave rise to two important results. First, it has been shown that the inversion distance can be computed in a simpler way with the help of the DCJ operation. Second, the DCJ operation originated the DCJ-indel distance, that allows the comparison of genomes with unequal contents, considering DCJ, insertions and deletions, and can be computed in linear time. Results In the present work we put these two results together to solve an open problem, showing that, when the graph that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. We also give a lower and an upper bound for the inversion-indel distance in the presence of bad components

    On Distance and Sorting of the Double Cut-and-Join and the Inversion-*indel* Model

    Get PDF
    Willing E. On Distance and Sorting of the Double Cut-and-Join and the Inversion-*indel* Model. Bielefeld: Universität Bielefeld; 2018.In der vergleichenden Genomik werden zwei oder mehrere Genome hinsichtlich ihres Verwandtschaftsgrades verglichen. Das Ziel dieser Arbeit ist die Erforschung von mathematischen Modellen, die zum einen die evolutionäre *Distanz*, zum anderen die evolutionären Vorgänge zwischen zwei Genomen bestimmen können. Neben Methoden, welche auf einer niedrigen Ebene, z. B. den Basen(paarungen), ansetzen, sind auch abstraktere Modelle, die auf einzelnen Genen oder noch größeren Abschnitten Genome vergleichen, etabliert. Handelt es sich auf niedrigerer Ebene um einzelne Basen, die eingefügt, gelöscht oder ersetzt werden, sind es auf höherer Ebene beispielsweise ganze Gene. Auf höherer Ebene können Ergebnisse sogenannter Umordnungsprozesse (*genome rearrangements*) beobachtet werden, welche in einem *Sortierszenario* beschrieben werden. Im Vergleich eines Genoms mit einem anderen können dies unter anderem Inversionen, Translokationen, aber auch Einfügungen oder Löschungen von großen Bereichen sein. Ein bekanntes Modell ist das *Inversionsmodell*, welches den Verwandtschaftsgrad zweier Genome ausschließlich durch Inversionen bestimmt. Ein weiteres ist das *double cut-and-join (DCJ)* Modell, welches neben Inversionen auch Translokationen, Chromosomenfusionen, bzw. -fissionen, sowie Integration und Extraktion von kleinen zirkulären Trägern erlaubt. Die Distanz ist hierbei die Anzahl Zwischenschritte eines Sortierszenarios von geringster Länge. Diese Dissertation ist in zwei Teile gegliedert. Der erste Teil beschäftigt sich mit dem zufälligen Ziehen eines Sortierszenarios innerhalb des DCJ-Modells. Neben einigen naiven Ansätzen interessieren wir uns im Wesentlichen dafür, jedes Szenario mit gleicher Wahrscheinlichkeit, also uniform verteilt, zu ziehen. Hierfür wird nicht nur der gesamte Sortierraum betrachtet, sondern auch Maßnahmen zur effizienten Berechnung aufgezeigt. Der vorgestellte Algorithmus ist in einer Software-suite implementiert und wird hinsichtlich seiner Erzeugung von zufälligen Szenarien evaluiert. Der zweite Teil der Arbeit beschäftigt sich mit dem Inversions-*indel* Modell. Dieses wenig erforschte Modell erlaubt Inversionen, sowie Einfügungen und Löschungen (kurz *indels*). Dessen Distanz soll in Abhängigkeit von der DCJ- bzw. der DCJ-*indel*-Distanz wiedergegeben werden. Wir erweitern altbekannte Datenstrukturen des Inversionsmodells um Einfügungen und Löschungen repräsentieren zu können. Hierfür benutzen wir unter anderem Ansätze aus zwei anderen Modellen: Die Erweiterung des DCJ-Modells um indels, sowie die Ermittlung der Abhängigkeit von DCJ- und Inversionsmodell. Um die minimale Anzahl an Inversionen, Einfügungen und Löschungen zu ermitteln muss beachtet werden, dass durch Inversionen zwei oder mehr getrennte Bereiche, die zur Löschung vorgesehen sind, verschmolzen werden. Diese können sodann in einem einzigen Schritt gelöscht werden. Ähnlich verhält es sich mit Einfügungen. Zunächst betrachten wir Instanzen in denen die DCJ-indel-Distanz und die Inversions-indel-Distanz identisch sind. Im Weiteren gehen wir dazu über, schwierige Instanzen, d.h. jene die mehr Schritte benötigen als die DCJ(-indel)-Distanz, zu berechnen. Zu diesen Zweck müssen die unterschiedlichen Eigenschaften der Instanzen und deren Auswirkungen ausgemacht werden. Durch geschickte Reduzierung des Lösungsraums gelangen wir zu einer Menge von Basisfällen, welche wir durch erschöpfende Aufzählung lösen können. Insgesamt bieten die unternommenen Schritte nicht nur die Lösung der Inversions-indel Distanz in Abhängigkeit zur DCJ-indel Distanz, sondern auch eine Möglichkeit des Sortierens. Die Suche nach einer exakten Lösung für das Distanz- und das Sortierproblem im Inversions-indel Modell blieb lange unbeantwortet. Der Hauptbeitrag dieser Arbeit liegt darin diese zwei Fragen zu klären

    A Unifying Model of Genome Evolution Under Parsimony

    Get PDF
    We present a data structure called a history graph that offers a practical basis for the analysis of genome evolution. It conceptually simplifies the study of parsimonious evolutionary histories by representing both substitutions and double cut and join (DCJ) rearrangements in the presence of duplications. The problem of constructing parsimonious history graphs thus subsumes related maximum parsimony problems in the fields of phylogenetic reconstruction and genome rearrangement. We show that tractable functions can be used to define upper and lower bounds on the minimum number of substitutions and DCJ rearrangements needed to explain any history graph. These bounds become tight for a special type of unambiguous history graph called an ancestral variation graph (AVG), which constrains in its combinatorial structure the number of operations required. We finally demonstrate that for a given history graph GG, a finite set of AVGs describe all parsimonious interpretations of GG, and this set can be explored with a few sampling moves.Comment: 52 pages, 24 figure

    Models and Algorithms for Comparative Genomics

    Get PDF
    The deluge of sequenced whole-genome data has motivated the study of comparative genomics, which provides global views on genome evolution, and also offers practical solutions in deciphering the functional roles of components of genomes. A fundamental computational problem in whole-genome comparison is to infer the most likely large-scale events~(rearrangements and content-modifying events) of given genomes during their history of evolution. Based on the principle of parsimony, such inference is usually formulated as the so called edit distance problems~(for two genomes) or median problems~(for multiple genomes), i.e., to compute the minimum number of certain types of large-scale events that can explain the differences of the given genomes. In this dissertation, we develop novel algorithms for edit distance problems and median problems and also apply them to analyze and annotate biological datasets. For pairwise whole-genome comparison, we study the most challenging cases of edit distance problems---the given genomes contain duplicate genes. We proposed several exact algorithms and approximation algorithms under various combinations of large-scale events. Specifically, we designed the first exact algorithm to compute the edit distance under the DCJ~(double-cut-and-join) model, and the first exact algorithm to compute the edit distance under a model including DCJ operations and segmental duplications. We devised a (1.5+ϵ)(1.5 + \epsilon)-approximation algorithm to compute the edit distance under a model including DCJ operations, insertions, and deletions. We also proposed a very fast and exact algorithm to compute the exemplar breakpoint distance. For multiple whole-genome comparison, we study the median problem under the DCJ model. We designed a polynomial-time algorithm using a network flow formulation to compute the so called adequate subgraphs---a central phase in computing the median. We also proved that an existing upper bound of the median distance is tight. These above algorithms determine the correspondence between functional elements~(for instance, genes) across genomes, and thus can be used to systematically infer functional relationships and annotate genomes. For example, we applied our methods to infer orthologs and in-paralogs between a pair of genomes---a key step in analyzing the functions of protein-coding genes. On biological whole-genome datasets, our methods run very fast, scale up to whole genomes, and also achieve very high accuracy

    Algorithms for reconstruction of chromosomal structures

    Get PDF

    Algorithms and Data Structures for Sequence Analysis in the Pan-Genomic Era

    Get PDF
    This thesis is motivated by two important processes in bioinformatics, namely variation calling and haplotyping. The contributions range from basic algorithms for sequence analysis, to the implementation of pipelines to deal with real data. Variation calling characterizes an individual's genome by identifying how it differs from a reference genome. It uses reads -- small DNA fragments -- extracted from a biological sample, and aligns them to the reference to identify the genetic variants present in the donor's genome. A related procedure is haplotype phasing. Sexual organisms have their genome organized in two sets of chromosomes, with equivalent functions. Each set is inherited from the mother and the father respectively, and its elements are called haplotypes. The haplotype phasing problem is, once genetic variants are discovered, to attribute them to either of the haplotypes. The first problem we consider is to efficiently index large collections of genomes. The Lempel-Ziv compression algorithms is a useful tool for this. We focus on two of its exponents, namely the RLZ and LZ77 algorithms. We analyze the first, and propose some modifications to both, to finally develop a scalable index for large and repetitive collections. Then, using that index, we propose a novel pipeline for variation calling to replace the single reference by thousands of them. We test our variation calling pipeline on a mutation-rich subsequence of a Finnish population genome. Our approach consistently outperforms the single-reference approach to variation calling. The second part of this thesis revolves around the haplotype phasing problem. First, we propose a generalization of sequence alignment for diploid genomes. Next we extend this model to offer a solution for the haplotype phasing problem in the family-trio setting (that is, when we know the variants present in an individual and in her parents). Finally, in the context of an existing read-based approach to haplotyping, we go back to basic algorithms, where we model the problem of pruning a set of reads aligned to a reference as an interval scheduling problem. We propose a exact solution that runs in subquadratic time and a 2-approximation algorithm that runs in linearithmic time.Motivaatio tähän tutkielmaan tulee kahdesta tärkeästä bioinformatiikan prosessista: geenimutaatioiden etsinnästä (variation calling) ja haplotyyppauksesta (haplotyping). Työssä edistetään sekvenssianalyysin algoritmiikkaa ja kehitetään työkaluja mittausdatan analysointiin. Geenimutaatioiden etsinnässä pyritään identifioimaan ne muutokset perimässä, jotka erottavat yksilön lajin referenssigenomista. Tähän tarkoitukseen käytetään perimää koodaavasta DNA-eristeestä luettuja lyhyitä sekvenssejä eli lukujaksoja (read sequences). Nämä lukujaksot linjataan referenssigenomiin, jolloin eroavuudet paljastavat yksilön geenimutaatiot. Hyvin samaan tapaan voidaan suorittaa haplotyyppausta: Suvullisesti lisääntyvillä eli diploidisilla organismeilla on perimä järjestynyt kahteen joukkoon kromosomeja, joissa vastinpareilla on sama funktio. Yksi kromosomijoukko peritään äidiltä ja toinen peritään isältä. Yksittäistä kromosomijoukkoa kutsutaan haplotyypiksi. Haplotyypin vaiheistus -ongelmassa (haplotype phasing problem) pyritään selvittämään löydetyille geenimutaatioille niiden oikea haplotyyppi. Ensimmäinen tutkielmassa tarkasteltu ongelma on suurten genomikokoelmien tehokas indeksointi. Lempel-Ziv tiivistysalgoritmit ovat hyödyllisiä tähän tarkoitukseen. Tutkielma keskittyy kahteen Lempel-Ziv algoritmien haaraan: RLZ ja LZ77 algoritmeihin. Ensimmäistä näistä analysoidaan, kumpaankin näistä esitetään muutoksia, ja lopputuloksena on skaalautuva indeksi suurille ja toisteisille kokoelmille. Kehitettyä indeksiä käytetään uuden geenimutaatioiden etsintään tarkoitetun työkalun komponenttina. Indeksi kykenee hyödyntämään tuhansia referenssigenomeita yhden sijaan. Työkalua testataan mutaatiorikkailla alueilla suomalaisen alipopulaation genomeista. Uusi lähestymistapa tuottaa systemaattisesti parempia tuloksia kuin aiempi yhteen referenssigenomiin perustuva lähestymistapa. Toinen osa tutkielmasta keskittyy haplotyyppaukseen. Aluksi sekvenssien linjauksesta esitetään yleistys diploidisille genomeille. Tämän jälkeen esitettyä mallia kehitetään ratkaisuksi haplotyypin vaiheistus -ongelmaan perhe-kolmikko-tapauksessa (missä geenimutaatiot on selvitetty yksilölle ja hänen vanhemmilleen). Lopuksi lukujaksoihin perustuvan haplotyyppien vaiheistus -ongelman tapauksessa palataan perusalgoritmiikkaan, kun johdetaan aikajanojen skedulointiongelmaan perustuva ratkaisu lukujaksojen suodatukseen; tutkielmassa esitetään tarkka polynomiaikainen ratkaisu ongelmaan sekä lähes lineaariaikainen 2-approksimaatioalgoritmi
    corecore