109 research outputs found

    Lep-Anchor : automated construction of linkage map anchored haploid genomes

    Get PDF
    Motivation: Linkage mapping provides a practical way to anchor de novo genome assemblies into chromosomes and to detect chimeric or otherwise erroneous contigs. Such anchoring improves with higher number of markers and individuals, as long as the mapping software can handle all the information. Recent software Lep-MAP3 can robustly construct linkage maps for millions of genotyped markers and on thousands of individuals, providing optimal maps for genome anchoring. For such large datasets, automated and robust genome anchoring tool is especially valuable and can significantly reduce intensive computational and manual work involved. Results: Here, we present a software Lep-Anchor (LA) to anchor genome assemblies automatically using dense linkage maps. As the main novelty, it takes into account the uncertainty of the linkage map positions caused by low recombination regions, cross type or poor mapping data quality. Furthermore, it can automatically detect and cut chimeric contigs, and use contig-contig, single read or alternative genome assembly alignments as additional information on contig order and orientations and to collapse haplotype contigs. We demonstrate the performance of LA using real data and show that it outperforms ALLMAPS on anchoring completeness and speed. Accuracy-wise LA and ALLMAPS are about equal, but at the expense of lower completeness of ALLMAPS. The software Chromonomer was faster than the other two methods but has major limitations and is lower in accuracy. We also show that with additional information, such as contig-contig and read alignments, the anchoring completeness can be improved by up to 70% without significant loss in accuracy. Based on simulated data, we conclude that the anchoring accuracy can be improved by utilizing information about map position uncertainty. Accuracy is the rate of contigs in correct orientation and completeness is the number contigs with inferred orientation.Peer reviewe

    Lep-MAP3 : robust linkage mapping even for low-coverage whole genome sequencing data

    Get PDF
    Motivation: Accurate and dense linkage maps are useful in family-based linkage and association studies, quantitative trait locus mapping, analysis of genome synteny and other genomic data analyses. Moreover, linkage mapping is one of the best ways to detect errors in de novo genome assemblies, as well as to orient and place assembly contigs within chromosomes. A small mapping cross of tens of individuals will detect many errors where distant parts of the genome are erroneously joined together. With more individuals and markers, even more local errors can be detected and more contigs can be oriented. However, the tools that are currently available for constructing linkage maps are not well suited for large, possible low-coverage, whole genome sequencing datasets. Results: Here we present a linkage mapping software Lep-MAP3, capable of mapping high-throughput whole genome sequencing datasets. Such data allows cost-efficient genotyping of millions of single nucleotide polymorphisms (SNPs) for thousands of individual samples, enabling, among other analyses, comprehensive validation and refinement of de novo genome assemblies. The algorithms of Lep-MAP3 can analyse low-coverage datasets and reduce data filtering and curation on any data. This yields more markers in the final maps with less manual work even on problematic datasets. We demonstrate that Lep-MAP3 obtains very good performance already on 5x sequencing coverage and outperforms the fastest available software on simulated data on accuracy and often on speed. We also construct de novo linkage maps on 7-12x whole-genome data on the Red postman butterfly (Heliconius erato) with almost 3 million markers.Peer reviewe

    Laskennallisia menetelmiä haplotyypien ennustamiseen ja paikallisten rinnastusten merkittävyyden arviointiin

    Get PDF
    This thesis which consists of an introduction and four peer-reviewed original publications studies the problems of haplotype inference (haplotyping) and local alignment significance. The problems studied here belong to the broad area of bioinformatics and computational biology. The presented solutions are computationally fast and accurate, which makes them practical in high-throughput sequence data analysis. Haplotype inference is a computational problem where the goal is to estimate haplotypes from a sample of genotypes as accurately as possible. This problem is important as the direct measurement of haplotypes is difficult, whereas the genotypes are easier to quantify. Haplotypes are the key-players when studying for example the genetic causes of diseases. In this thesis, three methods are presented for the haplotype inference problem referred to as HaploParser, HIT, and BACH. HaploParser is based on a combinatorial mosaic model and hierarchical parsing that together mimic recombinations and point-mutations in a biologically plausible way. In this mosaic model, the current population is assumed to be evolved from a small founder population. Thus, the haplotypes of the current population are recombinations of the (implicit) founder haplotypes with some point--mutations. HIT (Haplotype Inference Technique) uses a hidden Markov model for haplotypes and efficient algorithms are presented to learn this model from genotype data. The model structure of HIT is analogous to the mosaic model of HaploParser with founder haplotypes. Therefore, it can be seen as a probabilistic model of recombinations and point-mutations. BACH (Bayesian Context-based Haplotyping) utilizes a context tree weighting algorithm to efficiently sum over all variable-length Markov chains to evaluate the posterior probability of a haplotype configuration. Algorithms are presented that find haplotype configurations with high posterior probability. BACH is the most accurate method presented in this thesis and has comparable performance to the best available software for haplotype inference. Local alignment significance is a computational problem where one is interested in whether the local similarities in two sequences are due to the fact that the sequences are related or just by chance. Similarity of sequences is measured by their best local alignment score and from that, a p-value is computed. This p-value is the probability of picking two sequences from the null model that have as good or better best local alignment score. Local alignment significance is used routinely for example in homology searches. In this thesis, a general framework is sketched that allows one to compute a tight upper bound for the p-value of a local pairwise alignment score. Unlike the previous methods, the presented framework is not affeced by so-called edge-effects and can handle gaps (deletions and insertions) without troublesome sampling and curve fitting.Tässä väitöskirjassa esitetään uusia, tarkkoja ja tehokkaita laskennallisia menetelmiä populaation haplotyyppien ennustamiseen genotyypeistä sekä sekvenssien paikallisten rinnastusten merkittävyyden arviointiin. Käytetyt menetelmät perustuvat mm. dynaamiseen ohjelmointiin, jossa pienimmät osaongelmat ratkaistaan ensin ja näistä pienistä ratkaisuosista kootaan suurempien osaongelmien ratkaisuja. Organismin genomi on yleensä koodattu solun sisään DNA:han, yksinkertaistaen jonoon (sekvenssiin) emäksiä A, C, G ja T. Genomi on jäsentynyt kromosomeihin, jotka sisältävät tietyissä paikoissa esiintyviä muutoksia, merkkijaksoja. Diploidin organismin, kuten ihmisen, kromosomit (autosomit) esiintyvät pareittain. Yksilö perii parin toisen kromosomin isältään ja toisen äidiltään. Haplotyyppi on yksilön tietyissä paikoissa esiintyvien merkkijaksojen jono tietyssä kromosomiparin kromosomissa. Haplotyyppien mittaaminen suoraan on vaikeaa, mutta genotyypit ovat helpommin mitattavia. Genotyypit kertovat, mitkä kaksi merkkijaksoa kromosomin vastaavissa kohdissa esiintyy. Haplotyyppiaineistoja käytetään yleisesti esimerkiksi genettisten tautien tutkimiseen. Tämän vuoksi haplotyyppien laskennallinen ennustaminen genotyypeistä on tärkeä tutkimusongelma. Syötteenä ongelmassa on siis näyte tietyn populaation genotyypeistä, joista tulisi ennustaa haplotyypit jokaiselle näytteen yksilölle. Haplotyyppien ennustaminen genotyypeistä on mahdollista, koska haplotyypit ovat samankaltaisia yksilöiden välillä. Samankaltaisuus johtuu evoluution prosesseista, kuten periytymisestä, luonnonvalinnasta, migraatiosta ja isolaatiosta. Tässä väitöskirjassa esitetään kolme menetelmää haplotyypien määritykseen. Näistä tarkin menetelmä, nimeltään BACH, käyttää vaihtuva-asteista Markov-mallia ja bayesilaista tilastotiedettä haplotyyppien ennnustamiseen genotyyppiaineistosta. Menetelmän malli pystyy mallintamaan tarkasti geneettistä kytkentää eli fyysisesti lähekkäin sijaitsevien merkkijaksojen riippuvuutta. Tämä kytkentä näkyy haplotyyppijonojen lokaalina samankaltaisuutena. Paikallista rinnastusta käytetään esimerkiksi etsittäessä eri organismien genomien sekvensseistä samankaltaisia kohtia, esimerkiksi vastaavia geenejä. Paikallisen rinnastuksen hakualgoritmit löytävät vain samankaltaisimman kohdan, mutta eivät kerro, onko löydös tilastollisesti merkittävä. Yleinen tapa määrittää rinnastuksen tilastollista merkittävyyttä on laskea rinnastuksen hyvyydelle (pisteluvulle) p-arvo, joka kertoo rinnastuksen tilastollisen merkittävyyden. Väitöskirjan menetelmä paikallisten rinnastusten merkittävyyden laskemiseksi laskee sekvenssien paikalliselle rinnastukselle odotusarvon, joka antaa yleisesti käytettävälle p‐arvolle tiukan ylärajan. Vaikka malli on yksinkertainen, empiirisissä testeissä menetelmän antaman odotusarvon yksinkertainen johdannainen osoittautuu sangen tarkaksi p‐arvon estimaatiksi. Lähestymistavan etuna on, että sen avulla rinnastuksen aukot (poistot ja lisäykset) voidaan mallintaa suoraviivaisella tavalla

    Kermit: Guided Long Read Assembly using Coloured Overlap Graphs

    Get PDF
    With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly

    Kermit: Linkage map guided long read assembly

    Get PDF
    Background: With long reads getting even longer and cheaper, large scale sequencing projects can be accomplished without short reads at an affordable cost. Due to the high error rates and less mature tools, de novo assembly of long reads is still challenging and often results in a large collection of contigs. Dense linkage maps are collections of markers whose location on the genome is approximately known. Therefore they provide long range information that has the potential to greatly aid in de novo assembly. Previously linkage maps have been used to detect misassemblies and to manually order contigs. However, no fully automated tools exist to incorporate linkage maps in assembly but instead large amounts of manual labour is needed to order the contigs into chromosomes. Results: We formulate the genome assembly problem in the presence of linkage maps and present the first method for guided genome assembly using linkage maps. Our method is based on an additional cleaning step added to the assembly. We show that it can simplify the underlying assembly graph, resulting in more contiguous assemblies and reducing the amount of misassemblies when compared to de novo assembly. Conclusions: We present the first method to integrate linkage maps directly into genome assembly. With a modest increase in runtime, our method improves contiguity and correctness of genome assembly.Peer reviewe

    Predicting recombination frequency from map distance

    Get PDF
    Map distance is one of the key measures in genetics and indicates the expected number of crossovers between two loci. Map distance is estimated from the observed recombination frequency using mapping functions, the most widely used of those, Haldane and Kosambi, being developed at the time when the number of markers was low and unobserved crossovers had a substantial effect on the recombination fractions. In contemporary high-density marker data, the probability of multiple crossovers between adjacent loci is negligible and different mapping functions yield the same result, that is, the recombination frequency between adjacent loci is equal to the map distance in Morgans. However, high-density linkage maps contain an interpretation problem: the map distance over a long interval is additive and its association with recombination frequency is not defined. Here, we demonstrate with high-density linkage maps from humans and stickleback fishes that the inverses of Haldane's and Kosambi's mapping functions systematically underpredict recombination frequencies from map distance. To remedy this, we formulate a piecewise function that yields more accurate predictions of recombination frequency from map distance. Our results demonstrate that the association between map distance and recombination frequency is context-dependent and without a universal solution.Peer reviewe

    Kermit : Guided Long Read Assembly using Coloured Overlap Graphs

    Get PDF
    Peer reviewe

    Construction of Ultradense Linkage Maps with Lep-MAP2 : Stickleback F-2 Recombinant Crosses as an Example

    Get PDF
    High-density linkage maps are important tools for genome biology and evolutionary genetics by quantifying the extent of recombination, linkage disequilibrium, and chromosomal rearrangements across chromosomes, sexes, and populations. They provide one of the best ways to validate and refine de novo genome assemblies, with the power to identity errors in assemblies increasing with marker density. However, assembly of high-density linkage maps is still challenging due to software limitations. We describe Lep-MAP2, a software for ultradense genome-wide linkage map construction. Lep-MAP2 can handle various family structures and can account for achiasmatic meiosis to gain linkage map accuracy. Simulations show that Lep-MAP2 outperforms other available mapping software both in computational efficiency and accuracy. When applied to two large F-2-generation recombinant crosses between two nine-spined stickleback (Pungitius pungitius) populations, it produced two high-density (similar to 6 markers/cM) linkage maps containing 18,691 and 20,054 single nucleotide polymorphisms. The two maps showed a high degree of synteny, but female maps were 1.5-2 times longer than male maps in all linkage groups, suggesting genome-wide recombination suppression in males. Comparison with the genome sequence of the three-spined stickleback (Gasterosteus aculeatus) revealed a high degree of interspecific synteny with a low frequency (Peer reviewe

    First High-Density Linkage Map and Single Nucleotide Polymorphisms Significantly Associated With Traits of Economic Importance in Yellowtail Kingfish Seriola lalandi

    Get PDF
    The genetic resources available for the commercially important fish species Yellowtail kingfish (YTK) (Seriola lalandi) are relative sparse. To overcome this, we aimed (1) to develop a linkage map for this species, and (2) to identify markers/variants associated with economically important traits in kingfish (with an emphasis on body weight). Genetic and genomic analyses were conducted using 13,898 single nucleotide polymorphisms (SNPs) generated from a new high-throughput genotyping by sequencing platform, Diversity Arrays Technology (DArTseq (TM)) in a pedigreed population comprising 752 animals. The linkage analysis enabled to map about 4,000 markers to 24 linkage groups (LGs), with an average density of 3.4 SNPs per cM. The linkage map was integrated into a genome-wide association study (GWAS) and identified six variants/SNPs associated with body weight (P <5e(-8)) when a multi-locus mixed model was used. Two out of the six significant markers were mapped to LGs 17 and 23, and collectively they explained 5.8% of the total genetic variance. It is concluded that the newly developed linkage map and the significantly associated markers with body weight provide fundamental information to characterize genetic architecture of growth-related traits in this population of YTK S. lalandi.Peer reviewe

    A Linkage-Based Genome Assembly for the Mosquito Aedes albopictus and Identification of Chromosomal Regions Affecting Diapause

    Get PDF
    The Asian tiger mosquito, Aedes albopictus, is an invasive vector mosquito of substantial public health concern. The large genome size (similar to 1.19-1.28 Gb by cytofluorometric estimates), comprised of similar to 68% repetitive DNA sequences, has made it difficult to produce a high-quality genome assembly for this species. We constructed a high-density linkage map for Ae. albopictus based on 111,328 informative SNPs obtained by RNAseq. We then performed a linkage-map anchored reassembly of AalbF2, the genome assembly produced by Palatini et al. (2020). Our reassembled genome sequence, AalbF3, represents several improvements relative to AalbF2. First, the size of the AalbF3 assembly is 1.45 Gb, almost half the size of AalbF2. Furthermore, relative to AalbF2, AalbF3 contains a higher proportion of complete and single-copy BUSCO genes (84.3%) and a higher proportion of aligned RNAseq reads that map concordantly to a single location of the genome (46%). We demonstrate the utility of AalbF3 by using it as a reference for a bulk-segregant-based comparative genomics analysis that identifies chromosomal regions with clusters of candidate SNPs putatively associated with photoperiodic diapause, a crucial ecological adaptation underpinning the rapid range expansion and climatic adaptation of A. albopictus.Peer reviewe
    corecore