93 research outputs found

    Sobre modelos de rearranjo de genomas

    Get PDF
    Orientador: João MeidanisTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Rearranjo de genomas é o nome dado a eventos onde grandes blocos de DNA trocam de posição durante o processo evolutivo. Com a crescente disponibilidade de sequências completas de DNA, a análise desse tipo de eventos pode ser uma importante ferramenta para o entendimento da genômica evolutiva. Vários modelos matemáticos de rearranjo de genomas foram propostos ao longo dos últimos vinte anos. Nesta tese, desenvolvemos dois novos modelos. O primeiro foi proposto como uma definição alternativa ao conceito de distância de breakpoint. Essa distância é uma das mais simples medidas de rearranjo, mas ainda não há um consenso quanto à sua definição para o caso de genomas multi-cromossomais. Pevzner e Tesler deram uma definição em 2003 e Tannier et al. a definiram de forma diferente em 2008. Nesta tese, nós desenvolvemos uma outra alternativa, chamada de single-cut-or-join (SCJ). Nós mostramos que, no modelo SCJ, além da distância, vários problemas clássicos de rearranjo, como a mediana de rearranjo, genome halving e pequena parcimônia são fáceis, e apresentamos algoritmos polinomiais para eles. O segundo modelo que apresentamos é o formalismo algébrico por adjacências, uma extensão do formalismo algébrico proposto por Meidanis e Dias, que permite a modelagem de cromossomos lineares. Esta era a principal limitação do formalismo original, que só tratava de cromossomos circulares. Apresentamos algoritmos polinomiais para o cálculo da distância algébrica e também para encontrar cenários de rearranjo entre dois genomas. Também mostramos como calcular a distância algébrica através do grafo de adjacências, para facilitar a comparação com outras distâncias de rearranjo. Por fim, mostramos como modelar todas as operações clássicas de rearranjo de genomas utilizando o formalismo algébricoAbstract: Genome rearrangements are events where large blocks of DNA exchange places during evolution. With the growing availability of whole genome data, the analysis of these events can be a very important and promising tool for understanding evolutionary genomics. Several mathematical models of genome rearrangement have been proposed in the last 20 years. In this thesis, we propose two new rearrangement models. The first was introduced as an alternative definition of the breakpoint distance. The breakpoint distance is one of the most straightforward genome comparison measures, but when it comes to defining it precisely for multichromosomal genomes, there is more than one way to go about it. Pevzner and Tesler gave a definition in a 2003 paper, and Tannier et al. defined it differently in 2008. In this thesis we provide yet another alternative, calling it single-cut-or-join (SCJ). We show that several genome rearrangement problems, such as genome median, genome halving and small parsimony, become easy for SCJ, and provide polynomial time algorithms for them. The second model we introduce is the Adjacency Algebraic Theory, an extension of the Algebraic Formalism proposed by Meidanis and Dias that allows the modeling of linear chromosomes, the main limitation of the original formalism, which could deal with circular chromosomes only. We believe that the algebraic formalism is an interesting alternative for solving rearrangement problems, with a different perspective that could complement the more commonly used combinatorial graph-theoretic approach. We present polynomial time algorithms to compute the algebraic distance and find rearrangement scenarios between two genomes. We show how to compute the rearrangement distance from the adjacency graph, for an easier comparison with other rearrangement distances. Finally, we show how all classic rearrangement operations can be modeled using the algebraic theoryDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Generalizations of the genomic rank distance to indels

    Get PDF
    MOTIVATION: The rank distance model represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. RESULTS: We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the Escherichia coli strains, a feature not seen in the reference tree. AVAILABILITY AND IMPLEMENTATION: Code and instructions are available at https://github.com/meidanis-lab/rank-indel. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Algorithms and methods for large-scale genome rearrangements identification

    Get PDF
    Esta tesis por compendio aborda la definición formal de SB, empezando por Pares de Segmentos de alta puntuación (HSP), los cuales son bien conocidos y aceptados. El primer objetivo se centró en la detección de SB como una combinación de HSPs incluyendo repeticiones lo cual incrementó la complejidad del modelo. Como resultado, se obtuvo un método más preciso y que mejora la calidad de los resultados del estado del arte. Este método aplica reglas basadas en la adyacencia de SBs, permitiendo además detectar LSGR e identificarlos como inversiones, translocaciones o duplicaciones, constituyendo un framework capaz de trabajar con LSGR para organismos de un solo cromosoma. Más tarde en un segundo artículo, se utilizó este framework para refinar los bordes de los SBs. En nuestra novedosa propuesta, las repeticiones que flanquean los SB se utilizaron para refinar los bordes explotando la redundancia introducida por dichas repeticiones. Mediante un alineamiento múltiple de estas repeticiones se calculan los vectores de identidad del SB y de la secuencia consenso de las repeticiones alineadas. Posteriormente, una máquina de estados finitos diseñada para detectar los puntos de transición en la diferencia de ambos vectores determina los puntos de inicio y fin de los SB refinados. Este método también se mostró útil a la hora de detectar "puntos de ruptura" (conocidos como break points (BP)). Estos puntos aparecen como la región entre dos SBs adyacentes. El método no fuerza a que el BP sea una región o un punto, sino que depende de los alineamientos de las repeticiones y del SB en cuestión. El método es aplicado en un tercer trabajo, donde se afronta un caso de uso de análisis de metagenomas. Es bien sabido que la información almacenada en las bases de datos no corresponde necesariamente a las muestras no cultivadas contenidas en un metagenoma, y es posible imaginar que la asignación de una muestra de un metagenoma se vea dificultada por un evento reorganizativo. En el articulo se muestra que las muestras de un metagenoma que mapean sobre las regiones exclusivas de un genoma (aquellas que no comparte con otros genomas) respaldan la presencia de ese genoma en el metagenoma. Estas regiones exclusivas son fácilmente derivadas a partir de una comparación múltiple de genomas, como aquellas regiones que no forman parte de ningún SB. Una definición bajo un espacio de comparación múltiple de genomas es más precisa que las definiciones construidas a partir de una comparación de pares, ya que entre otras cosas, permite un refinamiento siguiendo un procedimiento similar al descrito en el segundo artículo (usando SBs, en vez de repeticiones). Esta definición también resuelve la contradicción existente en la definición de puntos de BPs (mencionado en la segunda publicación), por la cual una misma región de un genoma puede ser detectada como BP o formar parte de un SB dependiendo del genoma con el que se compare. Esta definición de SB en comparación múltiple proporciona además información precisa para la reconstrucción de LSGR, con vistas a obtener una aproximación del verdadero ancestro común entre especies. Además, proporciona una solución para el problema de la granularidad en la detección de SBs: comenzamos por SBs pequeños y bien conservados y a través de la reconstrucción de LSGR se va aumentando gradualmente el tamaño de dichos bloques. Los resultados que se esperan de esta línea de trabajo apuntan a una definición de una métrica destinada a obtener distancias inter genómicas más precisas, combinando similaridad entre secuencias y frecuencias de LSGR.Esta tesis es un compendio de tres artículos recientemente publicados en revistas de alto impacto, en los cuales mostramos el proceso que nos ha llevado a proponer la definición de Unidades Elementales de Conservación (regiones conservadas entre genomas que son detectadas después de una comparación múltiple), así como algunas operaciones básicas como inversiones, transposiciones y duplicaciones. Los tres artículos están transversalmente conectados por la detección de Bloques de Sintenia (SB) y reorganizaciones genómicas de gran escala (LSGR) (consultar sección 2), y respaldan la necesidad de elaborar el framework que se describe en la sección "Systems And Methods". De hecho, el trabajo intelectual llevado a cabo en esta tesis y las conclusiones aportadas por las publicaciones han sido esenciales para entender que una definición de SB apropiada es la clave para muchos de los métodos de comparativa genómica. Los eventos de reorganización del ADN son una de las principales causas de evolución y sus efectos pueden ser observados en nuevas especies, nuevas funciones biológicas etc. Las reorganizaciones a pequeña escala como inserciones, deleciones o substituciones han sido ampliamente estudiadas y existen modelos aceptados para detectarlas. Sin embargo, los métodos para identificar reorganizaciones a gran escala aún sufren de limitaciones y falta de precisión, debido principalmente a que no existe todavía una definición de SB aceptada. El concepto de SB hace referencia a regiones conservadas entre dos genomas que guardan el mismo orden y {strand. A pesar de que existen métodos para detectarlos, éstos evitan tratar con repeticiones o restringen la búsqueda centrándose solamente en las regiones codificantes en aras de un modelo más simple. El refinamiento de los bordes de estos bloques es a día de hoy un problema aún por solucionar

    A new algebraic approach to genome rearrangement models

    Full text link
    We present a unified framework for modelling genomes and their rearrangements in a genome algebra, as elements that simultaneously incorporate all physical symmetries. Building on previous work utilising the group algebra of the symmetric group, we explicitly construct the genome algebra for the case of unsigned circular genomes with dihedral symmetry and show that the maximum likelihood estimate (MLE) of genome rearrangement distance can be validly and more efficiently performed in this setting. We then construct the genome algebra for the general case, that is, for genomes represented by elements of an arbitrary group and symmetry group, and show that the MLE computations can be performed entirely within this framework. There is no prescribed model in this framework; that is, it allows any choice of rearrangements with arbitrary weights. Further, since the likelihood function is built from path probabilities -- a generalisation of path counts -- the framework may be utilised for any distance measure that is based on path probabilities.Comment: 35 page

    Genome reconstruction and combinatoric analyses of rearrangement evolution

    Get PDF
    Cancer is often associated with a high number of large-scale, structural rearrangements. In a highly selective environment, some `driver' mutations conferring clonal growth advantage will be positively selected, accounting for further cancer development. Clarifying their nature, as well as their contribution to the pathology is a major current focus of biomedical research. Next generation sequencing technologies can be used nowadays to generate high-resolution data-sets of these alterations in cancer genomes. This project has been developed along two main lines: 1) the reconstruction of cancer aberrant karyotypes, together with their underlying evolutionary history; 2) the elucidation of some combinatorial properties associated with gene duplications. We applied graph theory to the problem of reconstructing the final cancer genome sequence; additionally, we developed an algorithmic approach for the reconstruction of a multi-step evolution consistent with read coverage and paired end data, giving insights on the possible molecular mechanisms underlying rearrangements. Looking at the combinatorics of both tandem and inverted duplication, we developed an algebraic formalism for the representation of these processes. This allowed us to both explore the geometric properties of sequences arising by Tandem Duplication (TD), and obtain a recursion for the number of tandem duplications evolutions after n events. Such results are missing for inverted duplications, whose combinatorial properties have been nevertheless deeply elucidated. Our results have allowed: 1) the identification, through an original approach, of potential rearrangement mechanisms associated with cancer development, and 2) the definition and mathematical description of the complete evolutionary space of specific rearrangement classes

    On the distribution of cycles and paths in multichromosomal breakpoint graphs and the expected value of rearrangement distance

    Get PDF
    Feijão P, Martinez F, Thévenin A. On the distribution of cycles and paths in multichromosomal breakpoint graphs and the expected value of rearrangement distance. BMC Bioinformatics. 2015;16(Suppl 19): S1.Finding the smallest sequence of operations to transform one genome into another is an important problem in comparative genomics. The breakpoint graph is a discrete structure that has proven to be effective in solving distance problems, and the number of cycles in a cycle decomposition of this graph is one of the remarkable parameters to help in the solution of related problems. For a fixed k, the number of linear unichromosomal genomes (signed or unsigned) with n elements such that the induced breakpoint graphs have k disjoint cycles, known as the Hultman number, has been already determined. In this work we extend these results to multichromosomal genomes, providing formulas to compute the number of multichromosal genomes having a fixed number of cycles and/or paths. We obtain an explicit formula for circular multichromosomal genomes and recurrences for general multichromosomal genomes, and discuss how these series can be used to calculate the distribution and expected value of the rearrangement distance between random genomes

    A cubic algorithm for the generalized rank median of three genomes

    Get PDF
    The area of genome rearrangements has given rise to a number of interesting biological, mathematical and algorithmic problems. Among these, one of the most intractable ones has been that of finding the median of three genomes, a special case of the ancestral reconstruction problem. In this work we re-examine our recently proposed way of measuring genome rearrangement distance, namely, the rank distance between the matrix representations of the corresponding genomes, and show that the median of three genomes can be computed exactly in polynomial time O(n omega), where omega <= 3, with respect to this distance, when the median is allowed to be an arbitrary orthogonal matrix.ResultsWe define the five fundamental subspaces depending on three input genomes, and use their properties to show that a particular action on each of these subspaces produces a median. In the process we introduce the notion of M-stable subspaces. We also show that the median found by our algorithm is always orthogonal, symmetric, and conserves any adjacencies or telomeres present in at least 2 out of 3 input genomes.ConclusionsWe test our method on both simulated and real data. We find that the majority of the realistic inputs result in genomic outputs, and for those that do not, our two heuristics perform well in terms of reconstructing a genomic matrix attaining a score close to the lower bound, while running in a reasonable amount of time. We conclude that the rank distance is not only theoretically intriguing, but also practically useful for median-finding, and potentially ancestral genome reconstruction14FUNDAÇÃO DE AMPARO À PESQUISA DO ESTADO DE SÃO PAULO - FAPESP2016/01511-
    • …
    corecore