6 research outputs found

    Progressive Cactus is a multiple-genome aligner for the thousand-genome era

    Get PDF
    New genome assemblies have been arriving at a rapidly increasing pace, thanks to decreases in sequencing costs and improvements in third-generation sequencing technologies(1-3). For example, the number of vertebrate genome assemblies currently in the NCBI (National Center for Biotechnology Information) database(4) increased by more than 50% to 1,485 assemblies in the year from July 2018 to July 2019. In addition to this influx of assemblies from different species, new human de novo assemblies(5) are being produced, which enable the analysis of not only small polymorphisms, but also complex, large-scale structural differences between human individuals and haplotypes. This coming era and its unprecedented amount of data offer the opportunity to uncover many insights into genome evolution but also present challenges in how to adapt current analysis methods to meet the increased scale. Cactus(6), a reference-free multiple genome alignment program, has been shown to be highly accurate, but the existing implementation scales poorly with increasing numbers of genomes, and struggles in regions of highly duplicated sequences. Here we describe progressive extensions to Cactus to create Progressive Cactus, which enables the reference-free alignment of tens to thousands of large vertebrate genomes while maintaining high alignment quality. We describe results from an alignment of more than 600 amniote genomes, which is to our knowledge the largest multiple vertebrate genome alignment created so far

    Graphical pangenomics

    Get PDF
    Completely sequencing genomes is expensive, and to save costs we often analyze new genomic data in the context of a reference genome. This approach distorts our image of the inferred genome, an effect which we describe as reference bias. To mitigate reference bias, I repurpose graphical models previously used in genome assembly and alignment to serve as a reference system in resequencing. To do so I formalize the concept of a variation graph to link genomes to a graphical model of their mutual alignment that is capable of representing any kind of genomic variation, both small and large. As this model combines both sequence and variation information in one structure it serves as a natural basis for resequencing. By indexing the topology, sequence space, and haplotype space of these graphs and developing generalizations of sequence alignment suitable to them, I am able to use them as reference systems in the analysis of a wide array of genomic systems, from large vertebrate genomes to microbial pangenomes. To demonstrate the utility of this approach, I use my implementation to solve resequencing and alignment problems in the context of Homo sapiens and Saccharomyces cerevisiae. I use graph visualization techniques to explore variation graphs built from a variety of sources, including diverged human haplotypes, a gut microbiome, and a freshwater viral metagenome. I find that variation aware read alignment can eliminate reference bias at known variants, and this is of particular importance in the analysis of ancient DNA, where existing approaches result in significant bias towards the reference genome and concomitant distortion of population genetics results. I validate that the variation graph model can be applied to align RNA sequencing data to a splicing graph. Finally, I show that a classical pangenomic inference problem in microbiology can be solved using a resequencing approach based on variation graphs.Wellcome Trust PhD fellowshi

    New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

    Get PDF
    Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

    Computational haplotyping : theory and practice

    Get PDF
    Genomics has paved a new way to comprehend life and its evolution, and also to investigate causes of diseases and their treatment. One of the important problems in genomic analyses is haplotype assembly. Constructing complete and accurate haplotypes plays an essential role in understanding population genetics and how species evolve. In this thesis, we focus on computational approaches to haplotype assembly from third generation sequencing technologies. This involves huge amounts of sequencing data, and such data contain errors due to the single molecule sequencing protocols employed. Taking advantage of combinatorial formulations helps to correct for these errors to solve the haplotyping problem. Various computational techniques such as dynamic programming, parameterized algorithms, and graph algorithms are used to solve this problem. This thesis presents several contributions concerning the area of haplotyping. First, a novel algorithm based on dynamic programming is proposed to provide approximation guarantees for phasing a single individual. Second, an integrative approach is introduced to combining multiple sequencing datasets to generating complete and accurate haplotypes. The effectiveness of this integrative approach is demonstrated on a real human genome. Third, we provide a novel efficient approach to phasing pedigrees and demonstrate its advantages in comparison to phasing a single individual. Fourth, we present a generalized graph-based framework for performing haplotype-aware de novo assembly. Specifically, this generalized framework consists of a hybrid pipeline for generating accurate and complete haplotypes from data stemming from multiple sequencing technologies, one that provides accurate reads and other that provides long reads.Die Genomik hat neue Wege eröffnet, die es ermöglichen, die Evolution lebendiger Organismen zu verstehen, sowie die Ursachen zahlreicher Krankheiten zu erforschen und neue Therapien zu entwickeln. Ein wichtiges Problem ist die Assemblierung der Haplotypen eines Individuums. Diese Rekonstruktion von Haplotypen spielt eine zentrale Rolle für das Verständnis der Populationsgenetik und der Evolution einer Spezies. In der vorliegenden Arbeit werden Algorithmen zur Assemblierung von Haplotypen vorgestellt, die auf Sequenzierdaten der dritten Generation basieren. Dies erfordert große Mengen an Daten, welche wiederum Fehler enthalten, die die zugrunde liegenden Sequenzierprotokolle hervorbringen. Durch kombinatorische Formulierungen des Problems ist die Rekonstruktion von Haplotypen dennoch möglich, da Fehler erfolgreich korrigiert werden können. Verschiedene informatische Methoden, wie dynamische Programmierung, parametrisierte Algorithmen und Graph Algorithmen können verwendet werden, um dieses Problem zu lösen. Die vorliegende Arbeit stellt mehrere Lösungsansätze für die Rekonstruktion von Haplotypen vor. Als erstes wird ein neuartiger Algorithmus vorgestellt, der basierend auf dem Prinzip der dynamischen Programmierung Approximationsgarantien für das Haplotyping eines einzelnen Individuums liefert. Als zweites wird ein integrativer Ansatz präsentiert, um mehrere Sequenzierdatensätze zu kombinieren und somit akkurate Haplotypen zu generieren. Die Effektivität dieser Methode wird auf einem echten, menschlichen Datensatz demonstriert. Als drittes wird ein neuer, effzienter Algorithmus beschrieben, um Haplotypen verwandter Individuen simultan zu konstruieren und die Vorteile gegenüber der Betrachtung einzelner Individuen aufgezeigt. Als viertes präsentieren wir eine Graph-basierte Methode um mittels Haplotypinformation de-novo Assemblierung durchzuführen. Dieser Methode kombiniert Daten stammend von verschiedenen Sequenziertechnologien, welche entweder genaue oder aber lange Sequenzierreads liefern
    corecore