Search CORE

21 research outputs found

HGGA : hierarchical guided genome assembler

Author: Salmela Leena
Walve Riku
Publication venue
Publication date: 07/05/2022
Field of study

Background De novo genome assembly typically produces a set of contigs instead of the complete genome. Thus additional data such as genetic linkage maps, optical maps, or Hi-C data is needed to resolve the complete structure of the genome. Most of the previous work uses the additional data to order and orient contigs. Results Here we introduce a framework to guide genome assembly with additional data. Our approach is based on clustering the reads, such that each read in each cluster originates from nearby positions in the genome according to the additional data. These sets are then assembled independently and the resulting contigs are further assembled in a hierarchical manner. We implemented our approach for genetic linkage maps in a tool called HGGA. Conclusions Our experiments on simulated and real Pacific Biosciences long reads and genetic linkage maps show that HGGA produces a more contiguous assembly with less contigs and from 1.2 to 9.8 times higher NGA50 or N50 than a plain assembly of the reads and 1.03 to 6.5 times higher NGA50 or N50 than a previous approach integrating genetic linkage maps with contig assembly. Furthermore, also the correctness of the assembly remains similar or improves as compared to an assembly using only the read data.Peer reviewe

PubMed Central

Helsingin yliopiston digitaalinen arkisto

Variant Genotyping with Gap Filling

Author: Walve Riku
Publication venue: Helsingin yliopisto
Publication date: 01/01/2017
Field of study

Although recent developments in DNA sequencing have allowed for great leaps in both the quality and quantity of genome assembly projects, de novo assemblies still lack the efficiency and accuracy required for studying individual genomes. Thus, efficient and accurate methods for calling and genotyping structural variations are still needed. Structural variations are variations between genomes that are longer than a single nucleotide, i.e. they affect the structure of a genome as opposed to affecting only the content. Structural variations exist in many different types. By finding the structural variations between a donor genome and a high quality reference genome, genotyping the variations becomes the only required genome assembly step. The hardest of the structural variations to genotype is the insertion variant, which requires assembly to genotype; genotyping the other variants require different transformations of the reference genome. The methods currently used for constructing insertion variants are fairly basic; they are mostly linked to variation calling methods and are only able to construct small insertions. A subproblem in genome assembly, the gap filling problem, provides techniques that are very applicable to insertion genotyping. Yet there are currently no tools that take full advantage of the solution space. Gap filling takes the context and length of a missing sequence in a genome assembly and attempts to assemble the sequence. This thesis shows how gap filling can be used to assemble the insertion variants by modeling the problem of insertion genotyping as finding a path in de Bruijn graph that has approximately the estimated length of the insertion

Helsingin yliopiston digitaalinen arkisto

Kermit: Linkage map guided long read assembly

Author: Rastas Pasi
Salmela Leena
Walve Riku
Publication venue
Publication date: 01/03/2019
Field of study

Background: With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. Results: We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly. Conclusions: We present the first method to integrate linkage maps directly into genome assembly. With a modest increase in runtime, our method improves contiguity and correctness of genome assembly.Peer reviewe

Directory of Open Access Journals

Helsingin yliopiston digitaalinen arkisto

Space-Efficient Indexing of Spaced Seeds for Accurate Overlap Computation of Raw Optical Mapping Data

Author: Puglisi Simon
Salmela Leena
Walve Riku
Publication venue
Publication date: 01/08/2022
Field of study

A key problem in processing raw optical mapping data (Rmaps) is finding Rmaps originating from the same genomic region. These sets of related Rmaps can be used to correct errors in Rmap data, and to find overlaps between Rmaps to assemble consensus optical maps. Previous Rmap overlap aligners are computationally very expensive and do not scale to large eukaryotic data sets. We present SELKIE, an Rmap overlap aligner based on a spaced (l,k)-mer index which was pioneered in the Rmap error correction tool ELMER. Here we present a space efficient version of the index which is twice as fast as prior art while using just a quarter of the memory on a human data set. Moreover, our index can be used for filtering candidates for Rmap overlap computation, whereas ELMERI used the index only for error correction of Rmaps. By combining our filtering of Rmaps with the exhaustive, but highly accurate, algorithm of Valouev etal. (2006), SELKIE maintains or increases the accuracy of finding overlapping Rmaps on a bacterial dataset while being at least four times faster. Furthermore, for finding overlaps in a human dataset, SELKIE is up to two orders of magnitude faster than previous methods.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Variant genotyping with gap filling

Author: Mäkinen Veli
Salmela Leena
Walve Riku
Publication venue
Publication date: 01/01/2017
Field of study

Peer reviewe

Crossref

Directory of Open Access Journals

Helsingin yliopiston digitaalinen arkisto

Kermit: Guided Long Read Assembly using Coloured Overlap Graphs

Author: Rastas Pasi
Salmela Leena
Walve Riku
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly

Dagstuhl Research Online Publication Server

Accurate self-correction of errors in long reads using de Bruijn graphs

Author: Rivals Eric
Salmela Leena
Ukkonen Esko
Walve Riku
Publication venue
Publication date: 01/01/2016
Field of study

Peer reviewe

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

PubMed Central

Helsingin yliopiston digitaalinen arkisto

Kermit : Guided Long Read Assembly using Coloured Overlap Graphs

Author: Rastas Pasi Miikka Antero
Salmela Leena Maija
Walve Riku Mikael
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Improving Contiguity and Accuracy in Genome Assembly

Author: Walve Riku
Publication venue: 'University of Helsinki Libraries'
Publication date: 31/03/2023
Field of study

Though genome analysis is used in other places, understanding the effects genes have on humans is arguably its most significant use. A fundamental roadblock to genome analysis is the fact that genomes cannot be sequenced in their entirety. Instead, only short sequences filled with errors can be read from genomes called reads. An important step in analyzing genomes is then assembling the reads into the full genome. This thesis looks at both the problems of correcting errors and assembling the genomes. Error correction on the reads can be done by constructing a multiple sequence alignment over the set of reads. Multiple sequence alignment has to be approximated in order to efficiently correct the errors. Guided genome assembly is a variation on genome assembly, where we are additionally given data describing some structural information on the genome. This thesis describes two guided genome assembly methods. One is based on the idea of using linear location information and the other is using a more general framework by clustering the reads. Finally, this thesis reconsiders the problems of genome analysis from the perspective of optical maps. Specifically, the problem of efficient indexing is evaluated in the context of optical maps, as the data looks fundamentally different. Optical maps represent the genomes as lengths between cuts, rather than nucleotides.Geenien vaikutus on genomianalyysin tärkeimpiä käyttökohteita. Genomianalyysin keskeinen ongelma on se ettei genomeja pystytä sekvensoida kokonaisuudessaan. Sen sijaan vain lyhyitä pätkiä täynnä virheitä voidaan lukea. Tärkeä vaihe genomianalysoinnissa on rakentaa näistä pätkistä kokonainen genomi. Tämä väitöskirja pohtii sekä genomien rakennusta, että virheiden korjausta. Virheiden korjaus on olennaista, sillä luetut sekvenssit voivat sisältää paljon virheitä. Ohjattu genomin rakennus on variaatio genomin rakennuksesta, missä käytetään ylimääräistä dataa, joka kuvailee genomin rakennetta. Tässä väitöskirjassa kuvaillaan kaksi metodia ohjatun genomin rakennukseen. Toinen perustuu lineaarisen datan käyttöön ja toinen on yleisempi ratkaisu, joka jakaa sekvenssit ryppäisiin. Väitöskirja myös tarkastelee genomianalyysia optisten karttojen näkökulmasta. Erityisesti tehokas indeksointia tutkitaan optisten karttojen kontekstissa, jossa data näyttää huomattavan erilaiselta verrattuna perinteiseen genomin analysointiin. Optiset kartat esittävät genomeja leikkausten välisillä etäisyyksillä nukleotidien sijaan

Helsingin yliopiston digitaalinen arkisto

Accurate self-correction of errors in long reads using de Bruijn graphs

Author: Eric Rivals
Esko Ukkonen
Leena Salmela
Riku Walve
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref