435 research outputs found

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Comparing De Novo Genome Assembly: The Long and Short of It

    Get PDF
    Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium

    Genome assembly and quality control for non-model organisms

    Get PDF
    This thesis presents my work in genome assembly between 2010 and 2019. Chapter 1 is an introduction to the status of the field, presenting the challenges and opportunities on generating de novo genome assemblies. Chapter 2 presents the development of k-mer spectra validation for assembly completeness, from its beginnings as unique sequence coverage analyses, through its implementation in the Kmer Analysis Toolkit, up to its use to assess consensus accuracy on hybrid assemblies. Chapter 3 describes a series of objective guided de novo assembly strategies applied to non-model genomes, starting with the assembly of the medicinal plant C. roseus to investigate its biosynthesis pathways, continuing with the chromosome-scale assembly of the ash dieback fungus during the UK outbreak, and concluding with my work assembling the hexaploid wheat genome from whole genome shotgun short read data. Chapter 4 describes the creation of haplotype-collapsed assemblies for 16 specimens of Heliconius butterflies to enable evolutionary analyses, and presents the Sequence Distance Graph framework to work with genome graphs and multi-technology data integration as a step towards haplotype-specific assemblies. Finally, Chapter 5 discusses this research and its impact in the context of the present and future of the field
    • …
    corecore