2,112 research outputs found

    Strain-aware assembly of genomes from mixed samples using flow variation graphs

    Get PDF
    The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains

    STRONG: metagenomics strain resolution on assembly graphs

    Get PDF
    We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads

    Strukturell variasjon som påvirker genetisk miljøtilpasning i laksefisk

    Get PDF
    Structural variations (SVs), e.g. deletions, insertions, inversions and duplications of sequences, are a major source of genomic variation affecting more base pairs in the genome than single nucleotide polymorphisms (SNPs). Despite their increasingly recognised importance in adaptive evolution and species diversification, SVs are vastly understudied in most species. Long-read sequencing, together with recently developed bioinformatic tools, have provided step-change improvements in the precision and recall of SV detection and allow us to increase the detected SVs manyfold across the species range. In addition, long-reads represent a major shift in our ability to build continuous genome assemblies as fundamental resources for most genome wide studies. The work in this thesis utilises long-read data to generate multiple genome sequences for the two salmonid species Atlantic salmon (Salmo salar) and lake whitefish (Coregonus clupeaformis). We present the first pan-genome for Atlantic salmon, comprising 11 long-read-based assemblies across the species range. Among these, the highest quality genome has 2.55 Gbp assembled into chromosome sequences, 259 Mbp more sequence than in the previous Atlantic salmon reference genome. The genome has a highly improved continuity with contig N50 increasing from 58 kbp to 28.06 Mbp (484-fold). The detection of SVs in these 11 individuals, revealed 1,061,452 SVs, with an average of ~77.4 Mbp of sequence differing per sample. The Atlantic salmon has adapted to different river environment across a large geographical distribution. To investigate genomic variation underlying these adaptations, we associated SVs and environmental data in a dataset of 366 short-read samples genotyped using genome graph analyses. These analyses highlighted multiple SVs contributing to environmental adaptations, including an 18 kbp deletion encompassing a polymorphic segmental duplication of three genes associated with annual precipitation. Next, we use the Atlantic salmon pan-genome to study the emergence of supergenes. Because supergenes can be maintained over millions of years by balancing selection and typically exhibit strong recombination suppression, their underlying functional variants and how they are formed are largely unknown. Inversions are type of rearrangement commonly associated with supergenes, and by directly comparing multiple highly continuous genome assemblies we were able to detect a number of large inversions in Atlantic salmon. A 3 Mb inversion, estimated to be ~15,000-year-old, and segregating in North American populations, displayed supergene signatures with adaptive variation captured within the standard arrangement of the inversion, as well as other adaptive variation accumulating after the inversion occurred. Characterization of other inversions with matched repeat structures at the breakpoints did not show any supergene signatures, suggesting that shared breakpoint repeats may obstruct the supergene formation. Lastly, we created long-read based genome assemblies for sympatric species pairs (Dwarf and Normal) belonging to lake whitefish (Coregonus clupeaformis). The species pairs offer a suitable model system for studying genomic patterns of differentiation and in particular the role of SVs in speciation. By combining long-reads, direct assembly, and short-read methods we detect 89,909 high-confidence SVs in the species pair across two lakes, covering five times more sequence in the genome compared to SNPs. In the study, we highlight shared outliers of differentiation between the lakes, indicating that they contribute to speciation. Interestingly, we find that more than 70% of SVs differentiating between the Normal and Dwarf species pairs of lake whitefish are overlapping transposable elements. This work demonstrates that SVs may play an important role for the differentiation and speciation of sympatric species pairs in lake whitefish.Strukturell variasjon (SVer), for eksempel delesjoner, insersjoner, inversjoner og duplikasjoner av sekvens, er en viktig kilde til genomisk variasjon som samplet sett påvirker flere basepar i genomet enn punktmutasjoner (SNPs). Til tross for en økende annerkjennelse for at SVer spiller en viktig rolle i genetisk tilpassing til ulikt miljø og artsdannelse har denne typen variasjon vært lite studert i mange arter. Ny DNA-sekvenseringsteknologi med lengre leselengder (long-read sequencing), samt utvikling av nye bioinformatiske verktøy, har ført til drastiske forbedringer i deteksjonen av SVer. ‘Long-read’ sekvensering gjør det også mulig å lage mer komplette og sammenhengende genomsekvenser enn tidligere. I denne avhandlingen benytter vi oss av ‘long-read’ data til å lage flere genomsekvenser av høy kvalitet for to ulike laksefiskarter: Atlanterhavslaks (Salmo salar) og en Nordamerikansk type sik ‘lake whitefish’ (Coregonus clupeaformis). Her rapporterer vi det første pan-genomet for Atlanterhavslaks. Det består av 11 assemblier basert på ‘long- read’ sekvensering av individer fra fire ulike fylogeografiske grupper av villaks. Assembliet av høyest kvalitet inkluderer 2,55 Gbp sekvens i kromosomer, 259 Mbp mer enn det forrige referansegenomet til Atlanterhavslaks. I tillegg ble andelen sammenhengende sekvens, målt som contig N50, økt fra 58 kbp til 28,06 Mbp (484 ganger høyere). Vi fant 1.061.452 SVer på tvers av de 11 individene med ~77,4 Mbp gjennomsnittlig sekvensforskjell per prøve. Atlanterhavslaksen har over tid tilpasset miljøet i ulike elver. For å studere underliggende genetisk variasjon for denne tilpasningen assosierte vi SVer med ulike miljøvariabler i et datasett bestående av 366 ‘short-read’ sekvenserte prøver ved bruk av en genom-graf. Ved hjelp av disse analysene fant vi flere SVer som bidrar til miljøtilpasning, blant annet en 18 kbp lang delesjon som inneholder tre gener assosiert med mengden nedbør i området. Vi brukte så pan-genomet for Atlanterhavsaks til å studere dannelsen av ‘supergener’. Supergener er en sammenkobling av genetisk variasjon i koblingsulikevekt som for eksempel kan oppstå ved hjelp av store inversjoner. Her utnyttet vi 11 genomassemblier til å identifisere og karakterisere en rekke store inversjoner i Atlanterhavslaks. En av inversjonene på 3 Mbp, estimert til å være ~15.000 år gammel, viste signaturer for utvikling som supergen. For de andre inversjonene som var flankert av repetert DNA fant vi ikke karakteristiske trekk på supergener, noe som tyder på at det repetitive DNA forhindrer en dannelse av supergener. Til slutt lagde vi genomsekvenser for ulike former (‘Normal’ og ‘Dwarf’) av ‘lake whitefish’ (Coregonus clupeaformis) som lever i de samme innsjøene i Nord-Amerika. Genomsekvensene muliggjør studier av genomiske mekanismene bak artsdannelse i denne laksefisken. Ved å kombinere ‘long-read’ data, direkte sammenlikning av assemblier, og ‘short-read’ data fant vi 89,909 SVer som skilte de to formene av ‘lake whitefish’ i to innsjøer. SVene omfatter mer enn fem ganger flere basepar i genomet sammenlignet med SNPs. I studiet fant vi flere SVer med avvikende forekomst (‘outliers’) i de to formene av ‘lake whitefish’, noe som indikerer at disse SVene bidrar til artsdannelse. Videre fant vi at 70 % av SVene overlappet en form av repetert DNA kalt transposable elementer. Dette arbeidet understreker at SVer kan spille en viktig rolle for artsdannelse i ’lake whitefish’

    Development of efficient De Bruijn graph-based algorithms for genome assembly

    Get PDF
    Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] During the last two decades, thanks to the development of new sequencing techniques, the study of the genome has become very popular in order to discover the genetic variation present in both humans and other organisms. The predominant mode of genome analysis is through the assembly of reads in one or multiple chains for as long as possible. The most traditional way of assembly is the one that involves reads from a single genome. In this field, in the last decade, third-generation readings have emerged with new challenges for which there are no efficient solutions. The first contribution that has been made in this thesis is Compact-Flye, a tool for the efficient assembly of third-generation reads on the Flye algorithm. This tool is based on the ingenious use of compact data structures to improve typical assembly steps such as counting and indexing k-mers. Apart from the assembly of a genome, there are techniques that seek to assemble all the genomes contained in a given sample. This assembly is known as multiple sequence assembly or haplotype reconstruction, a subject also treated in this thesis. Our first approach to solving this has been viaDBG, which is the first solution based on de Bruijn graphs that offers results comparable to current techniques in viral genome assembly while maintaining the efficiency of these graphs. Our second contribution is ViQUF, which is a natural improvement on its predecessor. ViQUF completely changes the algorithm of viaDBG but continues to be based on the same structures, although with some variations that allow it not only to improve results in terms of time and quality, but also to provide additionalinformation such as an estimate of the relative presence of each species in the sample.[Resumen] Durante las últimas dos décadas, gracias al desarrollo de nuevas técnias secuenciación, el estudio del genoma ha ganado mucha popularidad de cara a conocer la variación genética presente tanto seres humanos como otros organismos. El modo predominante de análisis del genoma es a través del ensamblaje de lecturas en una o múltiples cadenas lo más largas posibles. La manera más tradicional de ensamblaje es el que implica lecturas provenientes de un solo genoma. En este campo, en la última década han surgido las lecturas de tercera generación con nuevos retos para los que no existen soluciones eficientes. La primera aportación que se ha realizado en esta tesis es Compact-Flye una herramienta para el ensamblaje eficiente de lecturas de tercera generación sobre el algoritmo Flye. Esta herramienta está basada en el uso igenioso de estructuras compactas de datos para mejorar etapas típicas del ensamblaje como el conteo e indexación de k-mers. Al margen del ensamblaje de un genoma existen técnicas que buscan ensamblar todos los genomas contenidos en una muestra determinada. Este ensamblaje es conocido como ensamblaje múltiple de secuencias o reconstrucción de haplotipos, tema también tratado en esta tesis. Nuestra primera aproximación para la resolución de este ha sido viaDBG, que es la primera solución basada en grafos de de Bruijn que ofrece resultados comparables a las técnicas vigentes en ensamblaje de genomas víricos, mientras que mantiene la eficiencia de estos grafos. Nuestra segunda aportación es ViQUF, que es una mejora natural de su predecesor. ViQUF cambia totalmente la algoritmia de viaDBG, pero sigue cimentándose en las mismas estructuras aunque con alguna variación que le permite no solo mejorar resultados en tiempo y calidad. Sino que además le permite aportar más información como estimaciones relativa de cada especie en la muestra.[Resumo] Durante as dúas últimas décadas, grazas ao desenvolvemento de novas técnicas de secuenciación, o estudo do xenoma fíxose moi popular para descubrir a variación xenética presente tanto nos humanos como noutros organismos. O modo predominante de análise do xenoma é a través da ensamblaxe de lecturas nunha ou varias cadeas o maior tempo posible. A forma máis tradicional de ensamblar é a que implica lecturas dun só xenoma. Neste campo, na última década xurdiron lecturas de terceira xeración con novos retos para os que non existen solucións eficientes. A primeira contribución que se fixo nesta tese é Compact-Flye, unha ferramenta para a montaxe eficiente de lecturas de terceira xeración sobre o algoritmo Flye. Esta ferramenta baséase no uso intelixente de estruturas de datos compactas para mellorar os pasos típicos de montaxe, como contar e indexar k-mers. Ademais da montaxe dun xenoma, existen técnicas que buscan ensamblar todos os xenomas contidos nunha determinada mostra. Este conxunto coñécese como conxunto de secuencias múltiples ou reconstrución de haplotipos, tema tamén tratado nesta tesis. O noso primeiro enfoque para resolver isto foi viaDBG, que é a primeira solución baseada en gráficos de Bruijn que ofrece resultados comparables ás técnicas actuais de ensamblaxe de xenoma viral, mantendo a eficiencia destes gráficos. A nosa segunda incorporación é ViQUF, que é unha mellora natural con respecto ao seu predecesor. ViQUF cambia completamente o algoritmo de viaDBG pero segue baseándose nas mesmas estruturas, aínda que con algunha variación que lle permite non só mellorar os resultados en tempo e calidade. Pero tamén permite achegar máis información como estimacións relativas de cada especie da mostra.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/53Xunta de Galicia; IG240.2020.1.185Xunta de Galicia; IN852A 2018/14Quiero agradecer al Centro de Investigación de Galicia “CITIC”, financiado por la Xunta de Galicia y la Unión Europea (European Regional Development Fund- Galicia 2014-2020 Program), con la beca ED431G 2019/01. También agradecer a la Xunta de Galicia/FEDER-UE que ha financiado esta tesis a través de las becas [ED431C 2021/53; IG240.2020.1.185; IN852A 2018/14]; al Ministerio de Ciencia e Innovación con las becas [TIN2016- 78011-C4-1-R; FPU17/02742; PID2019-105221RB-C41; PID2020-114635RB-I00]; y a la academia de Finlandia [grants 308030 and 323233 (LS)]

    Strainline: Full-length de novo viral haplotype reconstruction from noisy long reads

    Get PDF
    Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline
    corecore