526 research outputs found

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Host suppression and bioinformatics for sequence-based characterization of unknown pathogens.

    Get PDF
    Bioweapons and emerging infectious diseases pose formidable and growing threats to our national security. Rapid advances in biotechnology and the increasing efficiency of global transportation networks virtually guarantee that the United States will face potentially devastating infectious disease outbreaks caused by novel ('unknown') pathogens either intentionally or accidentally introduced into the population. Unfortunately, our nation's biodefense and public health infrastructure is primarily designed to handle previously characterized ('known') pathogens. While modern DNA assays can identify known pathogens quickly, identifying unknown pathogens currently depends upon slow, classical microbiological methods of isolation and culture that can take weeks to produce actionable information. In many scenarios that delay would be costly, in terms of casualties and economic damage; indeed, it can mean the difference between a manageable public health incident and a full-blown epidemic. To close this gap in our nation's biodefense capability, we will develop, validate, and optimize a system to extract nucleic acids from unknown pathogens present in clinical samples drawn from infected patients. This system will extract nucleic acids from a clinical sample, amplify pathogen and specific host response nucleic acid sequences. These sequences will then be suitable for ultra-high-throughput sequencing (UHTS) carried out by a third party. The data generated from UHTS will then be processed through a new data assimilation and Bioinformatic analysis pipeline that will allow us to characterize an unknown pathogen in hours to days instead of weeks to months. Our methods will require no a priori knowledge of the pathogen, and no isolation or culturing; therefore it will circumvent many of the major roadblocks confronting a clinical microbiologist or virologist when presented with an unknown or engineered pathogen

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Efficient Alignment Algorithms for DNA Sequencing Data

    Get PDF
    The DNA Next Generation Sequencing (NGS) technologies produce data at a low cost, enabling their application to many ambitious fields such as cancer research, disease control, personalized medicine etc. However, even after a decade of research, the modern aligners and assemblers are far from providing efficient and error free genome alignments and assemblies respectively. This is due to the inherent nature of the genome alignment and assembly problem, which involves many complexities. Many algorithms to address this problem have been proposed over the years, but there still is a huge scope for improvement in this research space. Many new genome alignment algorithms are proposed over time and one of the key differentiator among these algorithms is the efficiency of the genome alignment process. I present a new algorithm for efficiently finding Maximal Exact Matches (MEMs) between two genomes: E-MEM (Efficient computation of maximal exact matches for very large genomes). Computing MEMs is one of the most time consuming step during the alignment process. E-MEM can be used to find MEMs which are used as seeds in genome aligner to increase its efficiency. The E-MEM program is the most efficient algorithm as of today for computing MEMs and it surpasses all competition by large margins. There are many genome assembly algorithms available for use, but none produces perfect genome assemblies. It is important that assemblies produced by these algorithms are evaluated accurately and efficiently.This is necessary to make the right choice of the genome assembler to be used for all the downstream research and analysis. A fast genome assembly evaluator is a key factor when a new genome assembler is developed, to quickly evaluate the outcome of the algorithm. I present a fast and efficient genome assembly evaluator called LASER (Large genome ASsembly EvaluatoR), which is based on a leading genome assembly evaluator QUAST, but significantly more efficient both in terms of memory and run time. The NGS technologies limit the potential of genome assembly algorithms because of short read lengths and nonuniform coverage. Recently, third generation sequencing technologies have been proposed which promise very long reads and a uniform coverage. However, this technology comes with its own drawback of high error rate of 10 - 15% consisting mostly of indels. The long read sequencing data is useful only after error correction obtained using self read alignment (or read overlapping) techniques. I propose a new self read alignment algorithm for Pacific Biosciences sequencing data: HISEA (Hierarchical SEed Aligner), which has very high sensitivity and precision as compared to other state-of-the-art aligners. HISEA is also integrated into Canu assembly pipeline. Canu+HISEA produces better assemblies than Canu with its default aligner MHAP, at a much lower coverage

    Algorithm-Hardware Co-Design for Performance-driven Embedded Genomics

    Get PDF
    PhD ThesisGenomics includes development of techniques for diagnosis, prognosis and therapy of over 6000 known genetic disorders. It is a major driver in the transformation of medicine from the reactive form to the personalized, predictive, preventive and participatory (P4) form. The availability of genome is an essential prerequisite to genomics and is obtained from the sequencing and analysis pipelines of the whole genome sequencing (WGS). The advent of second generation sequencing (SGS), significantly, reduced the sequencing costs leading to voluminous research in genomics. SGS technologies, however, generate massive volumes of data in the form of reads, which are fragmentations of the real genome. The performance requirements associated with mapping reads to the reference genome (RG), in order to reassemble the original genome, now, stands disproportionate to the available computational capabilities. Conventionally, the hardware resources used are made of homogeneous many-core architecture employing complex general-purpose CPU cores. Although these cores provide high-performance, a data-centric approach is required to identify alternate hardware systems more suitable for affordable and sustainable genome analysis. Most state-of-the-art genomic tools are performance oriented and do not address the crucial aspect of energy consumption. Although algorithmic innovations have reduced runtime on conventional hardware, the energy consumption has scaled poorly. The associated monetary and environmental costs have made it a major bottleneck to translational genomics. This thesis is concerned with the development and validation of read mappers for embedded genomics paradigm, aiming to provide a portable and energy-efficient hardware solution to the reassembly pipeline. It applies the algorithmhardware co-design approach to bridge the saturation point arrived in algorithmic innovations with emerging low-power/energy heterogeneous embedded platforms. Essential to embedded paradigm is the ability to use heterogeneous hardware resources. Graphical processing units (GPU) are, often, available in most modern devices alongside CPU but, conventionally, state-of-the-art read mappers are not tuned to use both together. The first part of the thesis develops a Cross-platfOrm Read mApper using opencL (CORAL) that can distribute workload on all available devices for high performance. OpenCL framework mitigates the need for designing separate kernels for CPU and GPU. It implements a verification-aware filtration algorithm for rapid pruning and identification of candidate locations for mapping reads to the RG. Mapping reads on embedded platforms decreases performance due to architectural differences such as limited on-chip/off-chip memory, smaller bandwidths and simpler cores. To mitigate performance degradation, in second part of the thesis, we propose a REad maPper for heterogeneoUs sysTEms (REPUTE) which uses an efficient dynamic programming (DP) based filtration methodology. Using algorithm-hardware co-design and kernel level optimizations to reduce its memory footprint, REPUTE demonstrated significant energy savings on HiKey970 embedded platform with acceptable performance. The third part of the thesis concentrates on mapping the whole genome on an embedded platform. We propose a Pyopencl based tooL for gEnomic workloaDs tarGeting Embedded platfoRms (PLEDGER) which includes two novel contributions. The first one proposes a novel preprocessing strategy to generate low-memory footprint (LMF) data structure to fit all human chromosomes at the cost of performance. Second contribution is LMF DP-based filtration method to work in conjunction with the proposed data structures. To mitigate performance degradation, the kernel employs several optimisations including extensive usage of bit-vector operations. Extensive experiments using real human reads were carried out with state-of-the-art read mappers on 5 different platforms for CORAL, REPUTE and PLEDGER. The results show that embedded genomics provides significant energy savings with similar performance compared to conventional CPU-based platforms

    Insights into genomics and gene expression of oleaginous yeasts of the genus Rhodoturula

    Get PDF
    The Rhodotorula glutinis cluster includes oleaginous yeast species with high biotechnological potential as cell factories for lipid, carotenoid, and oleochemical production. They can grow on various low-cost residues, such as lignocellulose and crude glycerol. The accumulated microbial lipids have a fatty acid composition similar to vegetable oils, representing a sustainable alternative in food, feed, and biofuel industries. Strain manipulation could further increase its economic attractiveness. A basic prerequisite for this is to understand the genome organization of Rhodotorula spp. and the lipid and carotenoid metabolism on residual products in detail. This doctoral thesis made a substantial contribution to this:Hybrid genome assemblies from R. toruloides, R. glutinis, and R. babjevae were built de novo using a combination of short- and long-read sequencing. By doing this, high-quality genomes could be achieved, showing high completeness and contiguity (near-chromosome level). Among other things, this enabled the prediction of ploidy and the discovery of extrachromosomal structures. Phylogenetic analyses revealed high intraspecific divergence in R. babjevae. Using RNA-seq data generated from different growth conditions, gene annotation for R. toruloides could be significantly increased, and comprehensive gene expression data could be obtained. These allowed conclusions to be drawn about the activation and mechanism of lipid accumulation in R. toruloides grown on residual products such as crude glycerol and hemicellulosic hydrolysate

    Single-cell strand sequencing for structural variant analysis and genome assembly

    Get PDF
    Rapid advances of DNA sequencing technologies and development of computational tools to analyze sequencing data has started a revolution in the field of genetics. DNA sequencing has applications in medical research, disease diagnosis and treatment, and population genetic studies. Different sequencing techniques have their own advantages and limitations, and they can be used together to solve genome assembly and genetic variant detection. The focus of this thesis is on a specific single-cell sequencing technology, called strand sequencing. With its chromosome and haplotype-specific strand information, this technique has very powerful signals for discovery of genomic structural variations, haplotype phasing, and chromosome clustering. We developed statistical and compuptational tools to exploit this information from strand sequencing technology. I first present a computational framework for detecting structural variations in single cells using strand sequencing data. The presented tool is able to detect different types of structural variations in single cells including copy number variations, inversions, and inverted duplications, and also more complex biological events such as translocations and breakage-fusion-bridge (BFB) cycles. These variations and genomic rearrangements have been observed in cancer, therefore the discovery of such events within cell populations can lead to a more accurate picture of cancer genomes and help in diagnosis. In the remainder of this thesis, I elaborate on two computational pipelines for clustering long DNA sequences by their original chromosome and haplotype in the absence of a reference genome. These pipelines are developed to facilitate genome assembly and de novo haplotype phasing in a fast and accurate manner. The resulting haplotype assemblies can be useful in studying genomic variations with no reference bias, gaining insights in population genetics, and detection of compound heterozygosity.Die rasanten Fortschritte im Bereich der DNA-Sequenzierung und die Entwicklung von Computerwerkzeugen für die Analyse von Sequenzierdaten haben eine Revolution auf dem Gebiet der Genetik ausgelöst. Die DNA-Sequenzierung findet Anwendung in der medizinischen Forschung, bei der Diagnose und Behandlung von Krankheiten und bei populationsgenetischen Studien. Verschiedene Sequenzierungstechniken haben jeweils ihre Vorteile und Grenzen, können aber kombiniert werden, um Genome zu assemblieren oder um genetische Varianten zu finden. Der Schwerpunkt dieser Arbeit liegt auf einer speziellen Einzelzell Sequenzierungstechnologie, genannt Strand-Seq. Mit ihren chromosomen- und haplotypspezifischen Stranginformationen liefert diese Technik sehr starke Signale für die Entdeckung genomischer Strukturvariationen, die Rekonstruktion von Haplotypen und das Chromosomenclustering. Wir haben statistische und computergestützte Werkzeuge entwickelt, um diese Informationen der Strand-Seq Technologie zu nutzen. Zunächst präsentiere ich einen mathematisches Modell für die Erkennung struktureller Variationen in einzelnen Zellen unter Verwendung von Strand-Seq Daten. Das vorgestellte Tool ist in der Lage, verschiedene Arten von Strukturvariationen in Einzelzellen zu erkennen, darunter Kopienzahlvariationen, Inversionen und invertierte Duplikationen sowie komplexere biologische Ereignisse wie Translokationen und Break-Fusion- Bridge-Zyklen (BFB). Diese Variationen und genomischen Umlagerungen wurden bei Krebs beobachtet, sodass der Nachweis solcher Ereignisse in Zellpopulationen zu einem genaueren Bild des Krebsgenoms führen und bei der Diagnose helfen kann. Im Folgenden stelle ich zwei Computerpipelines vor, mit denen lange DNA-Sequenzen nach ihrem ursprünglichen Chromosom und Haplotyp geclustert werden können, wenn kein Referenzgenom verfügbar ist. Diese Pipelines wurden entwickelt, um die Genomassemblierung und die de novo Rekonstruktion von Haplotypen auf schnelle und genaue Weise zu erleichtern. Die daraus resultierenden Haplotypen können bei der Untersuchung genomischer Variationen ohne Referenzverzerrung, bei der Gewinnung von Einblicken in die Populationsgenetik und beim Nachweis von zusammengesetzter Heterozygotie nützlich sein

    Computational Methods for Structural Variation Analysis in Populations

    Get PDF
    Recent advances in long-read sequencing have given us an unprecedented view of structural variants (SVs). However, much of their role in disease and evolution remains unknown due to a number of technical and biological challenges, including the high error rate of most long-read sequencing data, the additional complexity of aligning around large variants, and biological differences in how the same SV can manifest in different individuals. In this thesis we introduce novel methods for structural variant analysis and demonstrate how they overcome many of these obstacles. First, we apply recent advances in data structures to the substring search problem and show how learned index structures can enable accelerated alignment of genomic reads. Next, we present an optimized SV calling pipeline that integrates improvements to existing software alongside two novel SV-processing methods, Iris and Jasmine, which improve the accuracy of SV breakpoints and sequences in individual samples and compare and integrate SV calls from multiple samples. Finally, we show how the introduction of CHM13, the first gap-free telomere-to-telomere human reference genome, enables for the first time variant calling in over 100 Mbp of newly resolved sequence and mitigates long-standing issues in variant calling that were attributed to gaps, errors, and minor alleles in the prior GRCh38 reference. We demonstrate the broad applicability of our advancements in SV inference by uncovering novel associations with gene expression in 444 human individuals from the 1000 Genomes Project, by detecting SVs in the tomato genome which affect fruit size and yield, and by comparing SVs between tumor and normal cells in organoids derived from the SKBR3 breast cancer cell line

    Probabilistic methods for quality improvement in high-throughput sequencing data

    Get PDF
    Advances in high-throughput next-generation sequencing (NGS) technologies have enabled the determination of millions of nucleotide sequences in massive parallelism at affordable costs. Many studies have shown increased error rates over Sanger sequencing, in sequencing data produced by mainstream next-generation sequencing platforms, and have demonstrated the negative impacts of sequencing errors on a wide range of applications of NGS. Thus, it is critically important for primary analysis of sequencing data to produce accurate, high-quality nucleotides for downstream bioinformatics pipelines. Two bioinformatics problems are dedicated to the direct removal of sequencing errors: base-calling and error-correction. However, existing error correction methods are mostly algorithmic and heuristics. Few methods can address insertion and deletion errors, the dominant error type produced by many platforms. On the other hand, most base-callers do not model the underlying genome structures of the sequencing data, which are necessary for improving base-calling quality especially in low-quality regions. The sequential application of base-caller and error-corrector do not fully offset their shortcomings. In recognition of these issues, in this dissertation, we propose a probabilistic framework that closely emulate the sequencing-by-synthesis (SBS) process adopted by many NGS platforms. The core idea is to model sequencing data (individual reads, or fluorescent intensities) as independent emissions from a Hidden Markov model (HMM) with transition distributions to model local and double-stranded dependence in the genome, and emission distributions to model the subtle error characteristics of the sequencers. Deriving from this backbone, we develop three novel methods for improving the data quality of high-throughput sequencing: 1) PREMIER, an accurate probabilistic error corrector of substitution errors in Illumina data, 2) PREMIER-bc, an integrated base-caller and error corrector that significantly improves base-calling quality, and 3) PREMIER-indel, an extended error correction method that addresses substitution, insertion and deletion errors for SBS-based sequencers with good empirical performance. Our foray of using probabilistic methods for base-calling and error correction provides the immediate benefits to downstream analyses with increased sequencing data quality, and more importantly, a flexible and fully-probabilistic basis to go beyond primary analysis

    Toward reliable biomarker signatures in the age of liquid biopsies - how to standardize the small RNA-Seq workflow

    Get PDF
    Small RNA-Seq has emerged as a powerful tool in transcriptomics, gene expression profiling and biomarker discovery. Sequencing cell-free nucleic acids, particularly microRNA (miRNA), from liquid biopsies additionally provides exciting possibilities for molecular diagnostics, and might help establish disease-specific biomarker signatures. The complexity of the small RNA-Seq workflow, however, bears challenges and biases that researchers need to be aware of in order to generate high-quality data. Rigorous standardization and extensive validation are required to guarantee reliability, reproducibility and comparability of research findings. Hypotheses based on flawed experimental conditions can be inconsistent and even misleading. Comparable to the well-established MIQE guidelines for qPCR experiments, this work aims at establishing guidelines for experimental design and pre-analytical sample processing, standardization of library preparation and sequencing reactions, as well as facilitating data analysis. We highlight bottlenecks in small RNA-Seq experiments, point out the importance of stringent quality control and validation, and provide a primer for differential expression analysis and biomarker discovery. Following our recommendations will en-courage better sequencing practice, increase experimental transparency and lead to more reproducible small RNA-Seq results. This will ultimately enhance the validity of biomarker signatures, and allow reliable and robust clinical predictions
    • …
    corecore