25 research outputs found
On the Inversion-Indel Distance
Willing E, Zaccaria S, Dias Vieira Braga M, Stoye J. On the Inversion-Indel Distance. BMC Bioinformatics. 2013;14(Suppl 15: Proc. of RECOMB-CG 2013): S3.Background
The inversion distance, that is the distance between two unichromosomal genomes with the same content allowing only inversions of DNA segments, can be computed thanks to a pioneering approach of Hannenhalli and Pevzner in 1995. In 2000, El-Mabrouk extended the inversion model to allow the comparison of unichromosomal genomes with unequal contents, thus insertions and deletions of DNA segments besides inversions. However, an exact algorithm was presented only for the case in which we have insertions alone and no deletion (or vice versa), while a heuristic was provided for the symmetric case, that allows both insertions and deletions and is called the inversion-indel distance. In 2005, Yancopoulos, Attie and Friedberg started a new branch of research by introducing the generic double cut and join (DCJ) operation, that can represent several genome rearrangements (including inversions). Among others, the DCJ model gave rise to two important results. First, it has been shown that the inversion distance can be computed in a simpler way with the help of the DCJ operation. Second, the DCJ operation originated the DCJ-indel distance, that allows the comparison of genomes with unequal contents, considering DCJ, insertions and deletions, and can be computed in linear time.
Results
In the present work we put these two results together to solve an open problem, showing that, when the graph that represents the relation between the two compared genomes has no bad components, the inversion-indel distance is equal to the DCJ-indel distance. We also give a lower and an upper bound for the inversion-indel distance in the presence of bad components
Bridging Disparate Views on the DCJ-Indel Model for a Capping-Free Solution to the Natural Distance Problem
One of the most fundamental problems in genome rearrangement is the (genomic) distance problem. It is typically formulated as finding the minimum number of rearrangements under a model that are needed to transform one genome into the other. A powerful multi-chromosomal model is the Double Cut and Join (DCJ) model.
While the DCJ model is not able to deal with some situations that occur in practice, like duplicated or lost regions, it was extended over time to handle these cases. First, it was extended to the DCJ-indel model, solving the issue of lost markers. Later ILP-solutions for so called natural genomes, in which each genomic region may occur an arbitrary number of times, were developed, enabling in theory to solve the distance problem for any pair of genomes. However, some theoretical and practical issues remained unsolved.
On the theoretical side of things, there exist two disparate views of the DCJ-indel model, motivated in the same way, but with different conceptualizations that could not be reconciled so far.
On the practical side, while the solutions for natural genomes typically perform well on telomere to telomere resolved genomes, they have been shown in recent years to quickly loose performance on genomes with a large number of contigs or linear chromosomes. This has been linked to a particular technique increasing the solution space superexponentially named capping.
Recently, we introduced a new conceptualization of the DCJ-indel model within the context of another rearrangement problem. In this manuscript, we will apply this new conceptualization to the distance problem. In doing this, we uncover the relation between the disparate conceptualizations of the DCJ-indel model. We are also able to derive an ILP solution to the distance problem that does not rely on capping and therefore significantly improves upon the performance of previous solutions for genomes with high numbers of contigs while still solving the problem exactly. To the best of our knowledge, our approach is the first allowing for an exact computation of the DCJ-indel distance for natural genomes with large numbers of linear chromosomes.
We demonstrate the performance advantage as well as limitations in comparison to an existing solution on simulated genomes as well as showing its practical usefulness in an analysis of 11 Drosophila genomes
Generalizations of the genomic rank distance to indels
MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online
On Distance and Sorting of the Double Cut-and-Join and the Inversion-*indel* Model
Willing E. On Distance and Sorting of the Double Cut-and-Join and the Inversion-*indel* Model. Bielefeld: UniversitÀt Bielefeld; 2018.In der vergleichenden Genomik werden zwei oder mehrere Genome hinsichtlich ihres Verwandtschaftsgrades verglichen. Das Ziel dieser Arbeit ist die Erforschung von mathematischen Modellen, die zum einen die evolutionÀre *Distanz*, zum anderen die evolutionÀren VorgÀnge zwischen zwei Genomen bestimmen können.
Neben Methoden, welche auf einer niedrigen Ebene, z. B. den Basen(paarungen), ansetzen, sind auch abstraktere Modelle, die auf einzelnen Genen oder noch gröĂeren Abschnitten Genome vergleichen, etabliert. Handelt es sich auf niedrigerer Ebene um einzelne Basen, die eingefĂŒgt, gelöscht oder ersetzt werden, sind es auf höherer Ebene beispielsweise ganze Gene. Auf höherer Ebene können Ergebnisse sogenannter Umordnungsprozesse (*genome rearrangements*) beobachtet werden, welche in einem *Sortierszenario* beschrieben werden. Im Vergleich eines Genoms mit einem anderen können dies unter anderem Inversionen, Translokationen, aber auch EinfĂŒgungen oder Löschungen von groĂen Bereichen sein. Ein bekanntes Modell ist das *Inversionsmodell*, welches den Verwandtschaftsgrad zweier Genome ausschlieĂlich durch Inversionen bestimmt. Ein weiteres ist das *double cut-and-join (DCJ)* Modell, welches neben Inversionen auch Translokationen, Chromosomenfusionen, bzw. -fissionen, sowie Integration und Extraktion von kleinen zirkulĂ€ren TrĂ€gern erlaubt. Die Distanz ist hierbei die Anzahl Zwischenschritte eines Sortierszenarios von geringster LĂ€nge.
Diese Dissertation ist in zwei Teile gegliedert. Der erste Teil beschĂ€ftigt sich mit dem zufĂ€lligen Ziehen eines Sortierszenarios innerhalb des DCJ-Modells. Neben einigen naiven AnsĂ€tzen interessieren wir uns im Wesentlichen dafĂŒr, jedes Szenario mit gleicher Wahrscheinlichkeit, also uniform verteilt, zu ziehen. HierfĂŒr wird nicht nur der gesamte Sortierraum betrachtet, sondern auch MaĂnahmen zur effizienten Berechnung aufgezeigt. Der vorgestellte Algorithmus ist in einer Software-suite implementiert und wird hinsichtlich seiner Erzeugung von zufĂ€lligen Szenarien evaluiert.
Der zweite Teil der Arbeit beschĂ€ftigt sich mit dem Inversions-*indel* Modell. Dieses wenig erforschte Modell erlaubt Inversionen, sowie EinfĂŒgungen und Löschungen (kurz *indels*). Dessen Distanz soll in AbhĂ€ngigkeit von der DCJ- bzw. der DCJ-*indel*-Distanz wiedergegeben werden. Wir erweitern altbekannte Datenstrukturen des Inversionsmodells um EinfĂŒgungen und Löschungen reprĂ€sentieren zu können. HierfĂŒr benutzen wir unter anderem AnsĂ€tze aus zwei anderen Modellen: Die Erweiterung des DCJ-Modells um indels, sowie die Ermittlung der AbhĂ€ngigkeit von DCJ- und Inversionsmodell.
Um die minimale Anzahl an Inversionen, EinfĂŒgungen und Löschungen zu ermitteln muss beachtet werden, dass durch Inversionen zwei oder mehr getrennte Bereiche, die zur Löschung vorgesehen sind, verschmolzen werden. Diese können sodann in einem einzigen Schritt gelöscht werden. Ăhnlich verhĂ€lt es sich mit EinfĂŒgungen. ZunĂ€chst betrachten wir Instanzen in denen die DCJ-indel-Distanz und die Inversions-indel-Distanz identisch sind. Im Weiteren gehen wir dazu ĂŒber, schwierige Instanzen, d.h. jene die mehr Schritte benötigen als die DCJ(-indel)-Distanz, zu berechnen. Zu diesen Zweck mĂŒssen die unterschiedlichen Eigenschaften der Instanzen und deren Auswirkungen ausgemacht werden. Durch geschickte Reduzierung des Lösungsraums gelangen wir zu einer Menge von BasisfĂ€llen, welche wir durch erschöpfende AufzĂ€hlung lösen können. Insgesamt bieten die unternommenen Schritte nicht nur die Lösung der Inversions-indel Distanz in AbhĂ€ngigkeit zur DCJ-indel Distanz, sondern auch eine Möglichkeit des Sortierens. Die Suche nach einer exakten Lösung fĂŒr das Distanz- und das Sortierproblem im Inversions-indel Modell blieb lange unbeantwortet. Der Hauptbeitrag dieser Arbeit liegt darin diese zwei Fragen zu klĂ€ren
The Distance and Median Problems in the Single-Cut-Or-Join Model with Single-Gene Duplications
Background.
In the field of genome rearrangement algorithms, models accounting for gene duplication lead often to hard problems. For example, while computing the pairwise distance is tractable in most duplication-free models, the problem is NP-complete for most extensions of these models accounting for duplicated genes. Moreover, problems involving more than two genomes, such as the genome median and the Small Parsimony problem, are intractable for most duplication-free models, with some exceptions, for example the Single-Cut-or-Join (SCJ) model.
Results.
We introduce a variant of the SCJ distance that accounts for duplicated genes, in the context of directed evolution from an ancestral genome to a descendant genome where orthology relations between ancestral genes and their descendant are known. Our model includes two duplication mechanisms: single-gene tandem duplication and the creation of single-gene circular chromosomes. We prove that in this model, computing the directed distance and a parsimonious evolutionary scenario in terms of SCJ and single-gene duplication events can be done in linear time. We also show that the directed median problem is tractable for this distance, while the rooted median problem, where we assume that one of the given genomes is ancestral to the median, is NP-complete. We also describe an Integer Linear Program for solving this problem. We evaluate the directed distance and rooted median algorithms on simulated data.
Conclusion.
Our results provide a simple genome rearrangement model, extending the SCJ model to account for single-gene duplications, for which we prove a mix of tractability and hardness results. For the NP-complete rooted median problem, we design a simple Integer Linear Program. Our publicly available implementation of these algorithms for the directed distance and median problems allow to solve efficiently these problems on large instances