4 research outputs found

    NovoGraph: Human genome graph construction from multiple long-read de novo assemblies

    Get PDF
    Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped

    NovoGraph: Human genome graph construction from multiple long-read de novo assemblies [version 2; referees: 2 approved]

    Get PDF
    Genome graphs are emerging as an important novel approach to the analysis of high-throughput human sequencing data. By explicitly representing genetic variants and alternative haplotypes in a mappable data structure, they can enable the improved analysis of structurally variable and hyperpolymorphic regions of the genome. In most existing approaches, graphs are constructed from variant call sets derived from short-read sequencing. As long-read sequencing becomes more cost-effective and enables de novo assembly for increasing numbers of whole genomes, a method for the direct construction of a genome graph from sets of assembled human genomes would be desirable. Such assembly-based genome graphs would encompass the wide spectrum of genetic variation accessible to long-read-based de novo assembly, including large structural variants and divergent haplotypes. Here we present NovoGraph, a method for the construction of a human genome graph directly from a set of de novo assemblies. NovoGraph constructs a genome-wide multiple sequence alignment of all input contigs and creates a graph by merging the input sequences at positions that are both homologous and sequence-identical. NovoGraph outputs resulting graphs in VCF format that can be loaded into third-party genome graph toolkits. To demonstrate NovoGraph, we construct a genome graph with 23,478,835 variant sites and 30,582,795 variant alleles from de novo assemblies of seven ethnically diverse human genomes (AK1, CHM1, CHM13, HG003, HG004, HX1, NA19240). Initial evaluations show that mapping against the constructed graph reduces the average mismatch rate of reads from sample NA12878 by approximately 0.2%, albeit at a slightly increased rate of reads that remain unmapped

    Approximate, simultaneous comparison of microbial genome architectures via syntenic anchoring of quiver representations

    No full text
    Motivation A long-standing limitation in comparative genomic studies is the dependency on a reference genome, which hinders the spectrum of genetic diversity that can be identified across a population of organisms. This is especially true in the microbial world where genome architectures can significantly vary. There is therefore a need for computational methods that can simultaneously analyze the architectures of multiple genomes without introducing bias from a reference. Results In this article, we present Ptolemy: a novel method for studying the diversity of genome architectures - such as structural variation and pan-genomes - across a collection of microbial assemblies without the need of a reference. Ptolemy is a 'top-down' approach to compare whole genome assemblies. Genomes are represented as labeled multi-directed graphs - known as quivers - which are then merged into a single, canonical quiver by identifying 'gene anchors' via synteny analysis. The canonical quiver represents an approximate, structural alignment of all genomes in a given collection encoding structural variation across (sub-) populations within the collection. We highlight various applications of Ptolemy by analyzing structural variation and the pan-genomes of different datasets composing of Mycobacterium, Saccharomyces, Escherichia and Shigella species. Our results show that Ptolemy is flexible and can handle both conserved and highly dynamic genome architectures. Ptolemy is user-friendly - requires only FASTA-formatted assembly along with a corresponding GFF-formatted file - and resource-friendly - can align 24 genomes in ∼10 mins with four CPUs and <2 GB of RAM. Availability and implementation Github: https://github.com/AbeelLab/ptolemy Supplementary information Supplementary data are available at Bioinformatics online.Pattern Recognition and Bioinformatic

    Taxogenómica en Rhodobacteraceae

    Get PDF
    La familia Rhodobacteraceae es una de las principales subdivisiones de la clase Alphaproteobacteria, que comprende en ocasiones más del 25% de la comunidad bacteriana total de las aguas superficiales oceánicas. La mayoría de las especies pertenecientes a esta familia son de origen marino y presentan una elevada diversidad fenotípica. El objetivo principal de esta Tesis Doctoral ha sido ampliar nuestro conocimiento sobre los miembros de esta familia mediante el estudio de sus genomas y esclarecer las relaciones taxonómicas entre los mismos utilizando como base el análisis filogenómico de dichos genomas, así como los índices de semejanza genómica apropiados a tal efecto. Para ello se ha procedido a realizar la secuenciación genómica de novo, ensamblado, anotación y depósito de 36 cepas tipo pertenecientes a Rhodobacteraceae cuyos genomas no se encontraban disponibles en las bases de datos públicas de genomas, ensamblados o lecturas genómicas. El objetivo general se ha desglosado en los siguientes objetivos específicos: I) Realizar una inferencia fenotípica mediante el análisis genómico de cada uno de los genomas secuenciados; II) Confirmar, en la medida que sea posible, los rasgos inferidos mediante experimentación con las cepas correspondientes; y III) Llevar a cabo una exploración filogenómica y revisión taxonómica de la familia Rhodobacteraceae. Este estudio ha permitido depositar las secuencias genómicas de 36 cepas tipo pertenecientes a Rhodobacteraceae cuyos genomas no habían sido secuenciados hasta el momento lo que ha contribuido a reducir el sesgo de secuenciación existente en las bases de datos públicas. Asimismo, la exploración de los genomas secuenciados y anotados ha posibilitado realizar una amplia inferencia fenotípica que ha abarcado el estudio de la organización y contenido del genoma, el metabolismo de carbohidratos, estrategias para la producción alternativa de energía, el metabolismo del nitrógeno, fósforo, hierro y compuestos C1, la degradación de compuestos aromáticos, la capacidad de adherencia y producción de biopelículas, la movilidad flagelar y quimiotaxis, la producción de sustancias antimicrobianas y la resistencia frente a compuestos tóxicos o condiciones fisicoquímicas adversas. Por último, este trabajo representa el estudio taxonómico más amplio efectuado hasta el momento en la familia Rhodobacteraceae mediante la aplicación de un enfoque taxogenómico que ha podido solventar parte de la problemática existente entre la filogenia y sistemática de este linaje. De este modo, el análisis filogenómico realizado en este estudio junto con los índices de semejanza genómica calculados ha permitido la descripción formal de una sinonimia a nivel de especie y 30 reclasificaciones a nivel de género, así como también, afinar el rango de corte propuesto para la circunscripción de géneros dentro de la familia Rhodobacteraceae
    corecore