21 research outputs found

    HGGA : hierarchical guided genome assembler

    Get PDF
    Background De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs. Results Here we introduce a framework to guide genome assembly with additional data. Our approach is based on clustering the reads, such that each read in each cluster originates from nearby positions in the genome according to the additional data. These sets are then assembled independently and the resulting contigs are further assembled in a hierarchical manner. We implemented our approach for genetic linkage maps in a tool called HGGA. Conclusions Our experiments on simulated and real Pacific Biosciences long reads and genetic linkage maps show that HGGA produces a more contiguous assembly with less contigs and from 1.2 to 9.8 times higher NGA50 or N50 than a plain assembly of the reads and 1.03 to 6.5 times higher NGA50 or N50 than a previous approach integrating genetic linkage maps with contig assembly. Furthermore, also the correctness of the assembly remains similar or improves as compared to an assembly using only the read data.Peer reviewe

    Variant Genotyping with Gap Filling

    Get PDF
    Although recent developments in DNA sequencing have allowed for great leaps in both the quality and quantity of genome assembly projects, de novo assemblies still lack the efficiency and accuracy required for studying individual genomes. Thus, efficient and accurate methods for calling and genotyping structural variations are still needed. Structural variations are variations between genomes that are longer than a single nucleotide, i.e. they affect the structure of a genome as opposed to affecting only the content. Structural variations exist in many different types. By finding the structural variations between a donor genome and a high quality reference genome, genotyping the variations becomes the only required genome assembly step. The hardest of the structural variations to genotype is the insertion variant, which requires assembly to genotype; genotyping the other variants require different transformations of the reference genome. The methods currently used for constructing insertion variants are fairly basic; they are mostly linked to variation calling methods and are only able to construct small insertions. A subproblem in genome assembly, the gap filling problem, provides techniques that are very applicable to insertion genotyping. Yet there are currently no tools that take full advantage of the solution space. Gap filling takes the context and length of a missing sequence in a genome assembly and attempts to assemble the sequence. This thesis shows how gap filling can be used to assemble the insertion variants by modeling the problem of insertion genotyping as finding a path in de Bruijn graph that has approximately the estimated length of the insertion

    Kermit: Linkage map guided long read assembly

    Get PDF
    Background: With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. Results: We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly. Conclusions: We present the first method to integrate linkage maps directly into genome assembly. With a modest increase in runtime, our method improves contiguity and correctness of genome assembly.Peer reviewe

    Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data

    Get PDF
    A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present SELKIE, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool ELMER. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas ELMERI used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev etal. (2006), SELKIE maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, SELKIE is up to two orders of magnitude faster than previous methods.Peer reviewe

    Kermit: Guided Long Read Assembly using Coloured Overlap Graphs

    Get PDF
    With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly

    Kermit : Guided Long Read Assembly using Coloured Overlap Graphs

    Get PDF
    Peer reviewe

    Improving Contiguity and Accuracy in Genome Assembly

    No full text
    Though genome analysis is used in other places, understanding the effects genes have on humans is arguably its most significant use. A fundamental roadblock to genome analysis is the fact that genomes cannot be sequenced in their entirety. Instead, only short sequences filled with errors can be read from genomes called reads. An important step in analyzing genomes is then assembling the reads into the full genome. This thesis looks at both the problems of correcting errors and assembling the genomes. Error correction on the reads can be done by constructing a multiple sequence alignment over the set of reads. Multiple sequence alignment has to be approximated in order to efficiently correct the errors. Guided genome assembly is a variation on genome assembly, where we are additionally given data describing some structural information on the genome. This thesis describes two guided genome assembly methods. One is based on the idea of using linear location information and the other is using a more general framework by clustering the reads. Finally, this thesis reconsiders the problems of genome analysis from the perspective of optical maps. Specifically, the problem of efficient indexing is evaluated in the context of optical maps, as the data looks fundamentally different. Optical maps represent the genomes as lengths between cuts, rather than nucleotides.Geenien vaikutus on genomianalyysin tärkeimpiä käyttökohteita. Genomianalyysin keskeinen ongelma on se ettei genomeja pystytä sekvensoida kokonaisuudessaan. Sen sijaan vain lyhyitä pätkiä täynnä virheitä voidaan lukea. Tärkeä vaihe genomianalysoinnissa on rakentaa näistä pätkistä kokonainen genomi. Tämä väitöskirja pohtii sekä genomien rakennusta, että virheiden korjausta. Virheiden korjaus on olennaista, sillä luetut sekvenssit voivat sisältää paljon virheitä. Ohjattu genomin rakennus on variaatio genomin rakennuksesta, missä käytetään ylimääräistä dataa, joka kuvailee genomin rakennetta. Tässä väitöskirjassa kuvaillaan kaksi metodia ohjatun genomin rakennukseen. Toinen perustuu lineaarisen datan käyttöön ja toinen on yleisempi ratkaisu, joka jakaa sekvenssit ryppäisiin. Väitöskirja myös tarkastelee genomianalyysia optisten karttojen näkökulmasta. Erityisesti tehokas indeksointia tutkitaan optisten karttojen kontekstissa, jossa data näyttää huomattavan erilaiselta verrattuna perinteiseen genomin analysointiin. Optiset kartat esittävät genomeja leikkausten välisillä etäisyyksillä nukleotidien sijaan
    corecore