235 research outputs found

    Evaluating synteny for improved comparative studies

    Get PDF
    Motivation: Comparative genomics aims to understand the structure and function of genomes by translating knowledge gained about some genomes to the object of study. Early approaches used pairwise comparisons, but today researchers are attempting to leverage the larger potential of multi-way comparisons. Comparative genomics relies on the structuring of genomes into syntenic blocks: blocks of sequence that exhibit conserved features across the genomes. Syntenic blocs are required for complex computations to scale to the billions of nucleotides present in many genomes; they enable comparisons across broad ranges of genomes because they filter out much of the individual variability; they highlight candidate regions for in-depth studies; and they facilitate whole-genome comparisons through visualization tools. However, the concept of syntenic block remains loosely defined. Tools for the identification of syntenic blocks yield quite different results, thereby preventing a systematic assessment of the next steps in an analysis. Current tools do not include measurable quality objectives and thus cannot be benchmarked against themselves. Comparisons among tools have also been neglected—what few results are given use superficial measures unrelated to quality or consistency. Results: We present a theoretical model as well as an experimental basis for comparing syntenic blocks and thus also for improving or designing tools for the identification of syntenic blocks. We illustrate the application of the model and the measures by applying them to syntenic blocks produced by three different contemporary tools (DRIMM-Synteny, i-ADHoRe and Cyntenator) on a dataset of eight yeast genomes. Our findings highlight the need for a well founded, systematic approach to the decomposition of genomes into syntenic blocks. Our experiments demonstrate widely divergent results among these tools, throwing into question the robustness of the basic approach in comparative genomics. We have taken the first step towards a formal approach to the construction of syntenic blocks by developing a simple quality criterion based on sound evolutionary principles. Contact: [email protected]

    Algorithms and methods for large-scale genome rearrangements identification

    Get PDF
    Esta tesis por compendio aborda la definición formal de SB, empezando por Pares de Segmentos de alta puntuación (HSP), los cuales son bien conocidos y aceptados. El primer objetivo se centró en la detección de SB como una combinación de HSPs incluyendo repeticiones lo cual incrementó la complejidad del modelo. Como resultado, se obtuvo un método más preciso y que mejora la calidad de los resultados del estado del arte. Este método aplica reglas basadas en la adyacencia de SBs, permitiendo además detectar LSGR e identificarlos como inversiones, translocaciones o duplicaciones, constituyendo un framework capaz de trabajar con LSGR para organismos de un solo cromosoma. Más tarde en un segundo artículo, se utilizó este framework para refinar los bordes de los SBs. En nuestra novedosa propuesta, las repeticiones que flanquean los SB se utilizaron para refinar los bordes explotando la redundancia introducida por dichas repeticiones. Mediante un alineamiento múltiple de estas repeticiones se calculan los vectores de identidad del SB y de la secuencia consenso de las repeticiones alineadas. Posteriormente, una máquina de estados finitos diseñada para detectar los puntos de transición en la diferencia de ambos vectores determina los puntos de inicio y fin de los SB refinados. Este método también se mostró útil a la hora de detectar "puntos de ruptura" (conocidos como break points (BP)). Estos puntos aparecen como la región entre dos SBs adyacentes. El método no fuerza a que el BP sea una región o un punto, sino que depende de los alineamientos de las repeticiones y del SB en cuestión. El método es aplicado en un tercer trabajo, donde se afronta un caso de uso de análisis de metagenomas. Es bien sabido que la información almacenada en las bases de datos no corresponde necesariamente a las muestras no cultivadas contenidas en un metagenoma, y es posible imaginar que la asignación de una muestra de un metagenoma se vea dificultada por un evento reorganizativo. En el articulo se muestra que las muestras de un metagenoma que mapean sobre las regiones exclusivas de un genoma (aquellas que no comparte con otros genomas) respaldan la presencia de ese genoma en el metagenoma. Estas regiones exclusivas son fácilmente derivadas a partir de una comparación múltiple de genomas, como aquellas regiones que no forman parte de ningún SB. Una definición bajo un espacio de comparación múltiple de genomas es más precisa que las definiciones construidas a partir de una comparación de pares, ya que entre otras cosas, permite un refinamiento siguiendo un procedimiento similar al descrito en el segundo artículo (usando SBs, en vez de repeticiones). Esta definición también resuelve la contradicción existente en la definición de puntos de BPs (mencionado en la segunda publicación), por la cual una misma región de un genoma puede ser detectada como BP o formar parte de un SB dependiendo del genoma con el que se compare. Esta definición de SB en comparación múltiple proporciona además información precisa para la reconstrucción de LSGR, con vistas a obtener una aproximación del verdadero ancestro común entre especies. Además, proporciona una solución para el problema de la granularidad en la detección de SBs: comenzamos por SBs pequeños y bien conservados y a través de la reconstrucción de LSGR se va aumentando gradualmente el tamaño de dichos bloques. Los resultados que se esperan de esta línea de trabajo apuntan a una definición de una métrica destinada a obtener distancias inter genómicas más precisas, combinando similaridad entre secuencias y frecuencias de LSGR.Esta tesis es un compendio de tres artículos recientemente publicados en revistas de alto impacto, en los cuales mostramos el proceso que nos ha llevado a proponer la definición de Unidades Elementales de Conservación (regiones conservadas entre genomas que son detectadas después de una comparación múltiple), así como algunas operaciones básicas como inversiones, transposiciones y duplicaciones. Los tres artículos están transversalmente conectados por la detección de Bloques de Sintenia (SB) y reorganizaciones genómicas de gran escala (LSGR) (consultar sección 2), y respaldan la necesidad de elaborar el framework que se describe en la sección "Systems And Methods". De hecho, el trabajo intelectual llevado a cabo en esta tesis y las conclusiones aportadas por las publicaciones han sido esenciales para entender que una definición de SB apropiada es la clave para muchos de los métodos de comparativa genómica. Los eventos de reorganización del ADN son una de las principales causas de evolución y sus efectos pueden ser observados en nuevas especies, nuevas funciones biológicas etc. Las reorganizaciones a pequeña escala como inserciones, deleciones o substituciones han sido ampliamente estudiadas y existen modelos aceptados para detectarlas. Sin embargo, los métodos para identificar reorganizaciones a gran escala aún sufren de limitaciones y falta de precisión, debido principalmente a que no existe todavía una definición de SB aceptada. El concepto de SB hace referencia a regiones conservadas entre dos genomas que guardan el mismo orden y {strand. A pesar de que existen métodos para detectarlos, éstos evitan tratar con repeticiones o restringen la búsqueda centrándose solamente en las regiones codificantes en aras de un modelo más simple. El refinamiento de los bordes de estos bloques es a día de hoy un problema aún por solucionar

    Improving Comparative Genomic Studies:Definitions and Algorithms for Syntenic Blocks

    Get PDF
    Comparative genomics aims to understand the structure of genomes and the function of various genomic fragments, by transferring knowledge gained from well studied genomes, to the new object of study. Rapid and inexpensive high-throughput sequencing is making available more and more complete genome sequences. Despite the significant scientific advance, we still lack good models for the evolution of the genomic architecture, therefore analyzing these genomes presents formidable challenges. Early approaches used pairwise comparisons, but today researchers are attempting to leverage the larger potential of multiway comparisons. Current approaches are based on the identification of so called syntenic blocks: blocks of sequence that exhibit conserved features across the genomes under study. Syntenic blocks are in many ways analogous to genesâ -in many cases, the markers are used to constructing them are genes. Like genes they can exist in multiple copies, in which case we could define analogs of orthology and paralogy. However, whereas genes are studied at the sequence level, syntenic blocks are too large for that level of detail - it is their structure and function as a unit that makes them valuable for genome level comparative studies. Syntenic blocks are required for complex computations to scale to the billions of nucleotides present in many genomes; they enable comparisons across broad ranges of genomes because they filter outmuch of the individual variability; they highlight candidate regions for in-depth studies; and they facilitate whole-genome comparisons through visualization tools. The identification of such blocks is the first step in comparative studies, yet its effect on final results has not been well studied, nor has any formalization of syntenic blocks been proposed. Tools for the identification of syntenic blocks yield quite different results, thereby preventing a systematic assessment of the next steps in an analysis. Current tools do not include measurable quality objectives and thus cannot be benchmarked against themselves. Comparisons among tools have also been neglected - what few results are given use superficial measures unrelated to quality or consistency. In this thesis we address two major challenges, and present: (i) a theoretical model as well as an experimental basis for comparing syntenic blocks and thus also for improving the design of tools for the identification of syntenic blocks; (ii) a prototype model that serves as a basis for implementing effective synteny mining tools. We offer an overview of the milestones present in literature, on the development of concepts and tool related to synteny; we illustrate the application of the model and the measures by applying them to syntenic blocks produced by different contemporary tools on publicly available data sets. We have taken the first step towards a formal approach to the construction of syntenic blocks by developing a simple quality criterion based on sound evolutionary principles. Our experiments demonstrate widely divergent results among these tools, throwing into question the robustness of the basic approach in comparative genomics. Our findings highlight the need for a well founded, systematic approach to the decomposition of genomes into syntenic blocks and motivate the second part of the work - starting from the proposed model, we extend the concept with data dependent features and constraints, in order to test the concept on cases of interest

    Progressive Cactus is a multiple-genome aligner for the thousand-genome era

    Get PDF
    New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies(1-3). For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database(4) increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies(5) are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus(6), a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far

    Ice-Age Climate Adaptations Trap the Alpine Marmot in a State of Low Genetic Diversity.

    Get PDF
    Some species responded successfully to prehistoric changes in climate [1, 2], while others failed to adapt and became extinct [3]. The factors that determine successful climate adaptation remain poorly understood. We constructed a reference genome and studied physiological adaptations in the Alpine marmot (Marmota marmota), a large ground-dwelling squirrel exquisitely adapted to the "ice-age" climate of the Pleistocene steppe [4, 5]. Since the disappearance of this habitat, the rodent persists in large numbers in the high-altitude Alpine meadow [6, 7]. Genome and metabolome showed evidence of adaptation consistent with cold climate, affecting white adipose tissue. Conversely, however, we found that the Alpine marmot has levels of genetic variation that are among the lowest for mammals, such that deleterious mutations are less effectively purged. Our data rule out typical explanations for low diversity, such as high levels of consanguineous mating, or a very recent bottleneck. Instead, ancient demographic reconstruction revealed that genetic diversity was lost during the climate shifts of the Pleistocene and has not recovered, despite the current high population size. We attribute this slow recovery to the marmot's adaptive life history. The case of the Alpine marmot reveals a complicated relationship between climatic changes, genetic diversity, and conservation status. It shows that species of extremely low genetic diversity can be very successful and persist over thousands of years, but also that climate-adapted life history can trap a species in a persistent state of low genetic diversity.This work was supported by the Francis Crick Institute which receives its core funding from Cancer Research UK (FC001134), the UK Medical Research Council (FC001134), and the Wellcome Trust (FC001134). CB and AC are supported by the Agence Nationale de la Recherche (project ANR-13-JSV7-0005) and the Centre National de la Recherche Scientifique (CNRS), CB is supported by the Rhône-Alpes region (Grant 15.005146.01). LD is supported by Agence Nationale de la Recherche (project ANR-12-ADAP-0009). TIG is supported by a Leverhulme Early Career Fellowship (Grant ECF-2015-453) and a NERC grant (NE/N013832/1). JMG is supported by a Hertha Finberg Fellowship (FWF T703). LDR is supported by the Diabetes UK RD Lawrence Fellowship (16/0005382)

    Two Rapidly Evolving Genes Contribute to Male Fitness in Drosophila

    Get PDF
    Purifying selection often results in conservation of gene sequence and function. The most functionally conserved genes are also thought to be among the most biologically essential. These observations have led to the use of sequence conservation as a proxy for functional conservation. Here we describe two genes that are exceptions to this pattern. We show that lack of sequence conservation among orthologs of CG15460 and CG15323 – herein named jean-baptiste (jb) and karr respectively – does not necessarily predict lack of functional conservation. These two Drosophila melanogaster genes are among the most rapidly evolving protein-coding genes in this species, being nearly as diverged from their D. yakuba orthologs as random sequences are. jb and karr are both expressed at an elevated level in larval males and adult testes, but they are not accessory gland proteins and their loss does not affect male fertility. Instead, knockdown of these genes in D. melanogaster via RNA interference caused male-biased viability defects. These viability effects occur prior to the third instar for jb and during late pupation for karr. We show that putative orthologs to jb and karr are also expressed strongly in the testes of other Drosophila species and have similar gene structure across species despite low levels of sequence conservation. While standard molecular evolution tests could not reject neutrality, other data hint at a role for natural selection. Together these data provide a clear case where a lack of sequence conservation does not imply a lack of conservation of expression or function

    Long-read sequencing improves assembly of Trichinella genomes 10-fold, revealing substantial synteny between lineages diverged over 7 million years

    Get PDF
    Genome assemblies can form the basis of comparative analyses fostering insight into the evolutionary genetics of a para- site’s pathogenicity, host–pathogen interactions, environmental constraints and invasion biology; however, the length and complexity of many parasite genomes has hampered the development of well-resolved assemblies. In order to improve Trichinella genome assemblies, the genome of the sylvatic encapsulated species Trichinella murrelli was sequenced using third-generation, long-read technology and, using syntenic comparisons, scaffolded to a reference genome assembly of Trichinella spiralis, markedly improving both. A high-quality draft assembly for T. murrelli was achieved that totalled 63·2 Mbp, half of which was condensed into 26 contigs each longer than 571 000 bp. When compared with previous assemblies for parasites in the genus, ours required 10-fold fewer contigs, which were five times longer, on average. Better assembly across repetitive regions also enabled resolution of 8 Mbp of previously indeterminate sequence. Furthermore, syntenic comparisons identified widespread scaffold misassemblies in the T. spiralis reference genome. The two new assemblies, organized for the first time into three chromosomal scaffolds, will be valuable resources for future studies linking phenotypic traits within each species to their underlying genetic bases
    corecore