2,112 research outputs found
Strain-aware assembly of genomes from mixed samples using flow variation graphs
The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains
STRONG: metagenomics strain resolution on assembly graphs
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads
Recommended from our members
Examining bacterial variation with genome graphs and Nanopore sequencing
A bacterial species' genetic content can be remarkably fluid. The collection of genes found within a given species is called the pan-genome and is generally much larger than the gene repertoire of a single cell. A consequence of this pan-genome is that bacterial genomes are highly adaptable and thus variable.
The dominant paradigm for analysing genetic variation relies on a central idea: all genomes in a species can be described as minor differences from a single reference genome, which serves as a coordinate system. As an introduction to this thesis, we outline why this approach is inadequate for bacteria and describe a new approach using genome graphs.
In the first chapter, we present algorithms for de novo variant discovery within such genome graphs and evaluate their performance with empirical data. The remaining chapters address a question relating to a critical bacterial pathogen: can Nanopore sequencing of Mycobacterium tuberculosis provide high-quality public health information? We collect data from Madagascar, South Africa, and England to help answer this question. First, we assess outbreaks identified using single-reference and genome graph methods. Second, we evaluate antimicrobial resistance predictions and introduce a framework for using genome graphs to improve current methods. Lastly, we train an M. tuberculosis-specific Nanopore basecalling model with considerable accuracy improvement.
Together, this thesis provides general methods for uncovering bacterial variation and applies them to an important global public health question.EMBL International PhD Programm
Strukturell variasjon som påvirker genetisk miljøtilpasning i laksefisk
Structural variations (SVs), e.g. deletions, insertions, inversions and duplications of sequences, are a major source of genomic variation affecting more base pairs in the genome than single nucleotide polymorphisms (SNPs). Despite their increasingly recognised importance in adaptive evolution and species diversification, SVs are vastly understudied in most species. Long-read sequencing, together with recently developed bioinformatic tools, have provided step-change improvements in the precision and recall of SV detection and allow us to increase the detected SVs manyfold across the species range. In addition, long-reads represent a major shift in our ability to build continuous genome assemblies as fundamental resources for most genome wide studies. The work in this thesis utilises long-read data to generate multiple genome sequences for the two salmonid species Atlantic salmon (Salmo salar) and lake whitefish (Coregonus clupeaformis).
We present the first pan-genome for Atlantic salmon, comprising 11 long-read-based assemblies across the species range. Among these, the highest quality genome has 2.55 Gbp assembled into chromosome sequences, 259 Mbp more sequence than in the previous Atlantic salmon reference genome. The genome has a highly improved continuity with contig N50 increasing from 58 kbp to 28.06 Mbp (484-fold). The detection of SVs in these 11 individuals, revealed 1,061,452 SVs, with an average of ~77.4 Mbp of sequence differing per sample. The Atlantic salmon has adapted to different river environment across a large geographical distribution. To investigate genomic variation underlying these adaptations, we associated SVs and environmental data in a dataset of 366 short-read samples genotyped using genome graph analyses. These analyses highlighted multiple SVs contributing to environmental adaptations, including an 18 kbp deletion encompassing a polymorphic segmental duplication of three genes associated with annual precipitation.
Next, we use the Atlantic salmon pan-genome to study the emergence of supergenes. Because supergenes can be maintained over millions of years by balancing selection and typically exhibit strong recombination suppression, their underlying functional variants and how they are formed are largely unknown. Inversions are type of rearrangement commonly associated with supergenes, and by directly comparing multiple highly continuous genome assemblies we were able to detect a number of large inversions in Atlantic salmon. A 3 Mb inversion, estimated to be ~15,000-year-old, and segregating in North American populations, displayed supergene signatures with adaptive variation captured within the standard arrangement of the inversion, as well as other adaptive variation accumulating after the inversion occurred. Characterization of other inversions with matched repeat structures at the breakpoints did not show any supergene signatures, suggesting that shared breakpoint repeats may obstruct the supergene formation.
Lastly, we created long-read based genome assemblies for sympatric species pairs (Dwarf and Normal) belonging to lake whitefish (Coregonus clupeaformis). The species pairs offer a suitable model system for studying genomic patterns of differentiation and in particular the role of SVs in speciation. By combining long-reads, direct assembly, and short-read methods we detect 89,909 high-confidence SVs in the species pair across two lakes, covering five times more sequence in the genome compared to SNPs. In the study, we highlight shared outliers of differentiation between the lakes, indicating that they contribute to speciation. Interestingly, we find that more than 70% of SVs differentiating between the Normal and Dwarf species pairs of lake whitefish are overlapping transposable elements. This work demonstrates that SVs may play an important role for the differentiation and speciation of sympatric species pairs in lake whitefish.Strukturell variasjon (SVer), for eksempel delesjoner, insersjoner, inversjoner og duplikasjoner av sekvens, er en viktig kilde til genomisk variasjon som samplet sett påvirker flere basepar i genomet enn punktmutasjoner (SNPs). Til tross for en økende annerkjennelse for at SVer spiller en viktig rolle i genetisk tilpassing til ulikt miljø og artsdannelse har denne typen variasjon vært lite studert i mange arter. Ny DNA-sekvenseringsteknologi med lengre leselengder (long-read sequencing), samt utvikling av nye bioinformatiske verktøy, har ført til drastiske forbedringer i deteksjonen av SVer. ‘Long-read’ sekvensering gjør det også mulig å lage mer komplette og sammenhengende genomsekvenser enn tidligere. I denne avhandlingen benytter vi oss av ‘long-read’ data til å lage flere genomsekvenser av høy kvalitet for to ulike laksefiskarter: Atlanterhavslaks (Salmo salar) og en Nordamerikansk type sik ‘lake whitefish’ (Coregonus clupeaformis).
Her rapporterer vi det første pan-genomet for Atlanterhavslaks. Det består av 11 assemblier basert på ‘long- read’ sekvensering av individer fra fire ulike fylogeografiske grupper av villaks. Assembliet av høyest kvalitet inkluderer 2,55 Gbp sekvens i kromosomer, 259 Mbp mer enn det forrige referansegenomet til Atlanterhavslaks. I tillegg ble andelen sammenhengende sekvens, målt som contig N50, økt fra 58 kbp til 28,06 Mbp (484 ganger høyere).
Vi fant 1.061.452 SVer på tvers av de 11 individene med ~77,4 Mbp gjennomsnittlig sekvensforskjell per prøve. Atlanterhavslaksen har over tid tilpasset miljøet i ulike elver. For å studere underliggende genetisk variasjon for denne tilpasningen assosierte vi SVer med ulike miljøvariabler i et datasett bestående av 366 ‘short-read’ sekvenserte prøver ved bruk av en genom-graf. Ved hjelp av disse analysene fant vi flere SVer som bidrar til miljøtilpasning, blant annet en 18 kbp lang delesjon som inneholder tre gener assosiert med mengden nedbør i området.
Vi brukte så pan-genomet for Atlanterhavsaks til å studere dannelsen av ‘supergener’. Supergener er en sammenkobling av genetisk variasjon i koblingsulikevekt som for eksempel kan oppstå ved hjelp av store inversjoner. Her utnyttet vi 11 genomassemblier til å identifisere og karakterisere en rekke store inversjoner i Atlanterhavslaks. En av inversjonene på 3 Mbp, estimert til å være ~15.000 år gammel, viste signaturer for utvikling som supergen. For de andre inversjonene som var flankert av repetert DNA fant vi ikke karakteristiske trekk på supergener, noe som tyder på at det repetitive DNA forhindrer en dannelse av supergener.
Til slutt lagde vi genomsekvenser for ulike former (‘Normal’ og ‘Dwarf’) av ‘lake whitefish’ (Coregonus clupeaformis) som lever i de samme innsjøene i Nord-Amerika. Genomsekvensene muliggjør studier av genomiske mekanismene bak artsdannelse i denne laksefisken. Ved å kombinere ‘long-read’ data, direkte sammenlikning av assemblier, og ‘short-read’ data fant vi 89,909 SVer som skilte de to formene av ‘lake whitefish’ i to innsjøer. SVene omfatter mer enn fem ganger flere basepar i genomet sammenlignet med SNPs. I studiet fant vi flere SVer med avvikende forekomst (‘outliers’) i de to formene av ‘lake whitefish’, noe som indikerer at disse SVene bidrar til artsdannelse. Videre fant vi at 70 % av SVene overlappet en form av repetert DNA kalt transposable elementer. Dette arbeidet understreker at SVer kan spille en viktig rolle for artsdannelse i ’lake whitefish’
Development of efficient De Bruijn graph-based algorithms for genome assembly
Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] During the last two decades, thanks to the development of new sequencing techniques,
the study of the genome has become very popular in order to discover the genetic variation present in both humans and other organisms. The predominant mode of genome analysis is through the assembly of reads in one or multiple chains for as long as possible. The most traditional way of assembly is the one that involves reads from a single genome. In this field, in the last decade, third-generation readings
have emerged with new challenges for which there are no efficient solutions. The first contribution that has been made in this thesis is Compact-Flye, a tool for the efficient assembly of third-generation reads on the Flye algorithm. This tool is based on the ingenious use of compact data structures to improve typical assembly steps such as counting and indexing k-mers. Apart from the assembly of a genome, there are techniques that seek to assemble all the genomes contained in a given sample.
This assembly is known as multiple sequence assembly or haplotype reconstruction, a subject also treated in this thesis. Our first approach to solving this has been viaDBG, which is the first solution based on de Bruijn graphs that offers results comparable to current techniques in viral genome assembly while maintaining the efficiency of these graphs. Our second contribution is ViQUF, which is a natural improvement on its predecessor. ViQUF completely changes the algorithm of viaDBG but continues
to be based on the same structures, although with some variations that allow it not only to improve results in terms of time and quality, but also to provide additionalinformation such as an estimate of the relative presence of each species in the sample.[Resumen] Durante las últimas dos décadas, gracias al desarrollo de nuevas técnias secuenciación, el estudio del genoma ha ganado mucha popularidad de cara a conocer la variación genética presente tanto seres humanos como otros organismos. El modo predominante de análisis del genoma es a través del ensamblaje de lecturas en una o múltiples cadenas lo más largas posibles. La manera más tradicional de ensamblaje es el que implica lecturas provenientes de un solo genoma. En este campo, en la última
década han surgido las lecturas de tercera generación con nuevos retos para los que no existen soluciones eficientes. La primera aportación que se ha realizado en esta tesis es Compact-Flye una herramienta para el ensamblaje eficiente de lecturas de tercera generación sobre el algoritmo Flye. Esta herramienta está basada en el uso igenioso de estructuras compactas de datos para mejorar etapas típicas del ensamblaje como el conteo e indexación de k-mers. Al margen del ensamblaje de un genoma existen técnicas que buscan ensamblar todos los genomas contenidos en una muestra determinada. Este ensamblaje es conocido como ensamblaje múltiple de secuencias o reconstrucción de haplotipos, tema también tratado en esta tesis. Nuestra primera aproximación para la resolución de este ha sido viaDBG, que es la primera solución basada en grafos de de Bruijn que ofrece resultados comparables a las técnicas vigentes en ensamblaje de genomas víricos, mientras que mantiene la eficiencia de estos grafos. Nuestra segunda aportación es ViQUF, que es una mejora natural de su predecesor. ViQUF cambia totalmente la algoritmia de viaDBG, pero sigue cimentándose en las mismas estructuras aunque con alguna variación que le permite no solo mejorar resultados en tiempo y calidad. Sino que además le permite aportar más información como estimaciones relativa de cada especie en la muestra.[Resumo] Durante as dúas últimas décadas, grazas ao desenvolvemento de novas técnicas de secuenciación, o estudo do xenoma fíxose moi popular para descubrir a variación xenética presente tanto nos humanos como noutros organismos. O modo predominante de análise do xenoma é a través da ensamblaxe de lecturas nunha ou varias cadeas o maior tempo posible. A forma máis tradicional de ensamblar é a que implica lecturas dun só xenoma. Neste campo, na última década xurdiron lecturas
de terceira xeración con novos retos para os que non existen solucións eficientes.
A primeira contribución que se fixo nesta tese é Compact-Flye, unha ferramenta para a montaxe eficiente de lecturas de terceira xeración sobre o algoritmo Flye. Esta ferramenta baséase no uso intelixente de estruturas de datos compactas para mellorar os pasos típicos de montaxe, como contar e indexar k-mers. Ademais da montaxe dun xenoma, existen técnicas que buscan ensamblar todos os xenomas contidos nunha determinada mostra. Este conxunto coñécese como conxunto de secuencias múltiples ou reconstrución de haplotipos, tema tamén tratado nesta tesis. O noso primeiro enfoque para resolver isto foi viaDBG, que é a primeira solución baseada en gráficos de Bruijn que ofrece resultados comparables ás técnicas actuais de ensamblaxe de xenoma viral, mantendo a eficiencia destes gráficos. A nosa segunda incorporación é ViQUF, que é unha mellora natural con respecto ao
seu predecesor. ViQUF cambia completamente o algoritmo de viaDBG pero segue baseándose nas mesmas estruturas, aínda que con algunha variación que lle permite non só mellorar os resultados en tempo e calidade. Pero tamén permite achegar máis información como estimacións relativas de cada especie da mostra.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/53Xunta de Galicia; IG240.2020.1.185Xunta de Galicia; IN852A 2018/14Quiero agradecer al Centro de Investigación de Galicia “CITIC”, financiado por la Xunta de Galicia y la
Unión Europea (European Regional Development Fund- Galicia 2014-2020 Program),
con la beca ED431G 2019/01. También agradecer a la Xunta de Galicia/FEDER-UE
que ha financiado esta tesis a través de las becas [ED431C 2021/53; IG240.2020.1.185;
IN852A 2018/14]; al Ministerio de Ciencia e Innovación con las becas [TIN2016-
78011-C4-1-R; FPU17/02742; PID2019-105221RB-C41; PID2020-114635RB-I00]; y
a la academia de Finlandia [grants 308030 and 323233 (LS)]
Strainline: Full-length de novo viral haplotype reconstruction from noisy long reads
Haplotype-resolved de novo assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either can handle only relatively accurate short read data, or collapse haplotype-specific variations into consensus sequence. Here, we present Strainline, a novel approach to assemble viral haplotypes from noisy long reads without a reference genome. Strainline is the first approach to provide strain-resolved, full-length de novo assemblies of viral quasispecies from noisy third-generation sequencing data. Benchmarking on simulated and real datasets of varying complexity and diversity confirm this novelty and demonstrate the superiority of Strainline
Recommended from our members
Computational methods for single cell RNA and genome assembly resolution using genetic variation
Genetic variation and natural selection have driven the evolutionary history on this planet and are responsible for creating us and all other life as we know it. Over the past several decades, the genomic revolution has allowed us to assess population variation across humans and other species and use that to link genotypes with phenotypes and infer evolutionary histories. In this thesis, I explore computational methods for using genetic variation to demultiplex and disambiguate complex data.
In single cell RNAseq, problems of batch effects, doublets, and ambient RNA are each sources of noise that impede our ability to infer the functional states of cells and compare them between experiments. One new popular new experimental design promising to solve each of these while also reducing experimental costs is mixturing multiple individuals' cells into a single experiment. In chapter 2, I present a method for clustering cells by genotype, calling doublets, and using the cross-genotype signal in singletons to estimate and remove ambient RNA. I compare this methods to other existing methods including one that requires \textit{a priori} information about the genotypes, and two which do not. I find that my method outperforms each of these methods across a wide range of data parameters and sample types.
In genome assembly, the recent higher throughput and lower cost of long read sequencing has revolutionized our ability to create reference quality genomes and has revitalized the assembly community. Now, massive efforts are taking place in the Darwin Tree of Life project and the Earth Biogenome project to create reference genomes for all multicelular eukaryotic life. This will create a scientific resource for the next generation of biological science, will serve as a conservation of data that could otherwise be lost in this time of mass extinction, and will allow for a much more broad understanding of evolution and the evolutionary history of life on Earth. While much progress has been made in data quality and assembly algorithms, some problems still exist. Until recently, the DNA input requirements for long read sequencing technologies made it impossible to sequence single individuals of these species with long reads. Also, high heterozygosity makes assembly more difficult due to the inherent ambiguity between heterozygous sequence versus paralogous sequence when confronted with inexact homology. One solution to the DNA input requirements would be to pool individuals, but this only increases the heterozygosity of the sample and reduces assembly quality. In chapter 3, we present the first high quality assembly of a single mosquito using new library preparation methods with reduced DNA requirements. This reduces the number of haplotypes to two, improving the assembly quality. In chapter 4, we further address the problems brought on by heterozygosity in assembly. I present a suite of tools that use the phasing consistency of multiple heterozygous sequences as a signal for physical linkage, thus using genetic variation to our advantage rather than as a challenge to overcome. This tool creates phased, linked assemblies and phasing aware scaffolding. Further, I provide a tool for phasing aware scaffolding on existing assemblies. This includes a novel haplotype phasing algorithm with some unique beneficial properties. It is robust to non-heterozygous variants as input and can detect and correct those genotypes. And it naturally extends to polyploid genomes.Wellcome Trus
- …