283 research outputs found

    Full-length de novo viral quasispecies assembly through variation graph construction

    Get PDF
    International audienceMotivation: Viruses populate their hosts as a viral quasispecies: a collection of genetically related mutant strains.Viral quasispecies assembly refers to reconstructing the strain-specific haplotypes from read data, and predicting their relative abundances within the mix of strains, an important step for various treatment-related reasons. Reference-genome-independent ("de novo") approaches have yielded benefits over reference-guided approaches, because reference-induced biases can become overwhelming when dealing with divergent strains. While being very accurate, extant de novo methods only yield rather short contigs. It remains to reconstruct full-length haplotypes together with their abundances from such contigs. Method: We first construct a variation graph, a recently popular, suitable structure for arranging and integrating several related genomes, from the short input contigs, without making use of a reference genome. To obtain paths through the variation graph that reflect the original haplotypes, we solve a minimization problem that yields a selection of maximal-length paths that is optimal in terms of being compatible with the read coverages computed for the nodes of the variation graph. We output the resulting selection of maximal length paths as the haplotypes, together with their abundances. Results: Benchmarking experiments on challenging simulated data sets show significant improvements in assembly contiguity compared to the input contigs, while preserving low error rates. As a consequence, our method outperforms all state-of-the-art viral quasispecies assem-blers that aim at the construction of full-length haplotypes, in terms of various relevant assembly measures. Our tool, Virus-VG, is publicly available at https://bitbucket.org/jbaaijens/ virus-vg

    De novo assembly of viral quasispecies using overlap graphs

    Get PDF
    Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Research. 2017;27(5):835-848

    Development of efficient De Bruijn graph-based algorithms for genome assembly

    Get PDF
    Programa Oficial de Doutoramento en Computación. 5009V01[Abstract] During the last two decades, thanks to the development of new sequencing techniques, the study of the genome has become very popular in order to discover the genetic variation present in both humans and other organisms. The predominant mode of genome analysis is through the assembly of reads in one or multiple chains for as long as possible. The most traditional way of assembly is the one that involves reads from a single genome. In this field, in the last decade, third-generation readings have emerged with new challenges for which there are no efficient solutions. The first contribution that has been made in this thesis is Compact-Flye, a tool for the efficient assembly of third-generation reads on the Flye algorithm. This tool is based on the ingenious use of compact data structures to improve typical assembly steps such as counting and indexing k-mers. Apart from the assembly of a genome, there are techniques that seek to assemble all the genomes contained in a given sample. This assembly is known as multiple sequence assembly or haplotype reconstruction, a subject also treated in this thesis. Our first approach to solving this has been viaDBG, which is the first solution based on de Bruijn graphs that offers results comparable to current techniques in viral genome assembly while maintaining the efficiency of these graphs. Our second contribution is ViQUF, which is a natural improvement on its predecessor. ViQUF completely changes the algorithm of viaDBG but continues to be based on the same structures, although with some variations that allow it not only to improve results in terms of time and quality, but also to provide additionalinformation such as an estimate of the relative presence of each species in the sample.[Resumen] Durante las últimas dos décadas, gracias al desarrollo de nuevas técnias secuenciación, el estudio del genoma ha ganado mucha popularidad de cara a conocer la variación genética presente tanto seres humanos como otros organismos. El modo predominante de análisis del genoma es a través del ensamblaje de lecturas en una o múltiples cadenas lo más largas posibles. La manera más tradicional de ensamblaje es el que implica lecturas provenientes de un solo genoma. En este campo, en la última década han surgido las lecturas de tercera generación con nuevos retos para los que no existen soluciones eficientes. La primera aportación que se ha realizado en esta tesis es Compact-Flye una herramienta para el ensamblaje eficiente de lecturas de tercera generación sobre el algoritmo Flye. Esta herramienta está basada en el uso igenioso de estructuras compactas de datos para mejorar etapas típicas del ensamblaje como el conteo e indexación de k-mers. Al margen del ensamblaje de un genoma existen técnicas que buscan ensamblar todos los genomas contenidos en una muestra determinada. Este ensamblaje es conocido como ensamblaje múltiple de secuencias o reconstrucción de haplotipos, tema también tratado en esta tesis. Nuestra primera aproximación para la resolución de este ha sido viaDBG, que es la primera solución basada en grafos de de Bruijn que ofrece resultados comparables a las técnicas vigentes en ensamblaje de genomas víricos, mientras que mantiene la eficiencia de estos grafos. Nuestra segunda aportación es ViQUF, que es una mejora natural de su predecesor. ViQUF cambia totalmente la algoritmia de viaDBG, pero sigue cimentándose en las mismas estructuras aunque con alguna variación que le permite no solo mejorar resultados en tiempo y calidad. Sino que además le permite aportar más información como estimaciones relativa de cada especie en la muestra.[Resumo] Durante as dúas últimas décadas, grazas ao desenvolvemento de novas técnicas de secuenciación, o estudo do xenoma fíxose moi popular para descubrir a variación xenética presente tanto nos humanos como noutros organismos. O modo predominante de análise do xenoma é a través da ensamblaxe de lecturas nunha ou varias cadeas o maior tempo posible. A forma máis tradicional de ensamblar é a que implica lecturas dun só xenoma. Neste campo, na última década xurdiron lecturas de terceira xeración con novos retos para os que non existen solucións eficientes. A primeira contribución que se fixo nesta tese é Compact-Flye, unha ferramenta para a montaxe eficiente de lecturas de terceira xeración sobre o algoritmo Flye. Esta ferramenta baséase no uso intelixente de estruturas de datos compactas para mellorar os pasos típicos de montaxe, como contar e indexar k-mers. Ademais da montaxe dun xenoma, existen técnicas que buscan ensamblar todos os xenomas contidos nunha determinada mostra. Este conxunto coñécese como conxunto de secuencias múltiples ou reconstrución de haplotipos, tema tamén tratado nesta tesis. O noso primeiro enfoque para resolver isto foi viaDBG, que é a primeira solución baseada en gráficos de Bruijn que ofrece resultados comparables ás técnicas actuais de ensamblaxe de xenoma viral, mantendo a eficiencia destes gráficos. A nosa segunda incorporación é ViQUF, que é unha mellora natural con respecto ao seu predecesor. ViQUF cambia completamente o algoritmo de viaDBG pero segue baseándose nas mesmas estruturas, aínda que con algunha variación que lle permite non só mellorar os resultados en tempo e calidade. Pero tamén permite achegar máis información como estimacións relativas de cada especie da mostra.Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431C 2021/53Xunta de Galicia; IG240.2020.1.185Xunta de Galicia; IN852A 2018/14Quiero agradecer al Centro de Investigación de Galicia “CITIC”, financiado por la Xunta de Galicia y la Unión Europea (European Regional Development Fund- Galicia 2014-2020 Program), con la beca ED431G 2019/01. También agradecer a la Xunta de Galicia/FEDER-UE que ha financiado esta tesis a través de las becas [ED431C 2021/53; IG240.2020.1.185; IN852A 2018/14]; al Ministerio de Ciencia e Innovación con las becas [TIN2016- 78011-C4-1-R; FPU17/02742; PID2019-105221RB-C41; PID2020-114635RB-I00]; y a la academia de Finlandia [grants 308030 and 323233 (LS)]

    Accurate Viral Population Assembly From Ultra-Deep Sequencing Data

    Get PDF
    Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads

    Strain-aware assembly of genomes from mixed samples using flow variation graphs

    Get PDF
    The goal of strain-aware genome assembly is to reconstruct all individual haplotypes from a mixed sample at the strain level and to provide abundance estimates for the strains

    BMC Genomics

    Get PDF
    BackgroundViruses have high mutation rates and generally exist as a mixture of variants in biological samples. Next-generation sequencing (NGS) approaches have surpassed Sanger for generating long viral sequences, yet how variants affect NGS de novo assembly remains largely unexplored.ResultsOur results from >\u200915,000 simulated experiments showed that presence of variants can turn an assembly of one genome into tens to thousands of contigs. This \u201cvariant interference\u201d (VI) is highly consistent and reproducible by ten commonly-used de novo assemblers, and occurs over a range of genome length, read length, and GC content. The main driver of VI is pairwise identities between viral variants. These findings were further supported by in silico simulations, where selective removal of minor variant reads from clinical datasets allow the \u201crescue\u201d of full viral genomes from fragmented contigs.ConclusionsThese results call for careful interpretation of contigs and contig numbers from de novo assembly in viral deep sequencing.2020-06-22T00:00:00Z32571214PMC7306937821

    Viral Quasispecies Reconstruction Using Next Generation Sequencing Reads

    Get PDF
    The genomic diversity of viral quasispecies is a subject of great interest, especially for chronic infections. Characterization of viral diversity can be addressed by high-throughput sequencing technology (454 Life Sciences, Illumina, SOLiD, Ion Torrent, etc.). Standard assembly software was originally designed for single genome assembly and cannot be used to assemble and estimate the frequency of closely related quasispecies sequences. This work focuses on parsimonious and maximum likelihood models for assembling viral quasispecies and estimating their frequencies from 454 sequencing data. Our methods have been applied to several RNA viruses (HCV, IBV) as well as DNA viruses (HBV), genotyped using 454 Life Sciences amplicon and shotgun methods
    corecore