10,421 research outputs found

    De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space

    Get PDF
    Sequencing technologies allow for an in-depth analysis of biological species but the size of the generated datasets introduce a number of analytical challenges. Recently, we demonstrated the application of numerical sequence representations and data transformations for the alignment of short reads to a reference genome. Here, we expand out approach for de novo assembly of short reads. Our results demonstrate that highly compressed data can encapsulate the signal suffi- ciently to accurately assemble reads to big contigs or complete genomes

    Long read mapping at scale: Algorithms and applications

    Get PDF
    Capability to sequence DNA has been around for four decades now, providing ample time to explore its myriad applications and the concomitant development of bioinformatics methods to support them. Nevertheless, disruptive technological changes in sequencing often upend prevailing protocols and characteristics of what can be sequenced, necessitating a new direction of development for bioinformatics algorithms and software. We are now at the cusp of the next revolution in sequencing due to the development of long and ultra-long read sequencing technologies by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT). Long reads are attractive because they narrow the scale gap between sizes of genomes and sizes of sequenced reads, with the promise of avoiding assembly errors and repeat resolution challenges that plague short read assemblers. However, long reads themselves sport error rates in the vicinity of 10-15%, compared to the high accuracy of short reads (< 1%). There is an urgent need to develop bioinformatics methods to fully realize the potential of long-read sequencers. Mapping and alignment of reads to a reference is typically the first step in genomics applications. Though long read technologies are still evolving, research efforts in bioinformatics have already produced many alignment-based and alignment-free read mapping algorithms. Yet, much work lays ahead in designing provably efficient algorithms, formally characterizing the quality of results, and developing methods that scale to larger input datasets and growing reference databases. While the current model to represent the reference as a collection of linear genomes is still favored due to its simplicity, mapping to graph-based representations, where the graph encodes genetic variations in a human population also becomes imperative. This dissertation work is focused on provably good and scalable algorithms for mapping long reads to both linear and graph references. We make the following contributions: 1. We develop fast and approximate algorithms for end-to-end and split mapping of long reads to reference genomes. Our work is the first to demonstrate scaling to the entire NCBI database, the collection of all curated and non-redundant genomes. 2. We generalize the mapping algorithm to accelerate the related problems of computing pairwise whole-genome comparisons. We shed light on two fundamental biological questions concerning genomic duplications and delineating microbial species boundaries. 3. We provide new complexity results for aligning reads to graphs under Hamming and edit distance models to classify the problem variants for which existence of a polynomial time solution is unlikely. In contrast to prior results that assume alphabets as a function of the problem size, we prove that the problem variants that allow edits in graph remain NP-complete for even constant-sized alphabets, thereby resolving computational complexity of the problem for DNA and protein sequence to graph alignments. 4. Finally, we propose a new parallel algorithm to optimally align long reads to large variation graphs derived from human genomes. It demonstrates near linear scaling on multi-core CPUs, resulting in run-time reduction from multiple days to three hours when aligning a long read set to an MHC human variation graph.Ph.D

    Entropy-scaling search of massive biological data

    Get PDF
    Many datasets exhibit a well-defined structure that can be exploited to design faster search tools, but it is not always clear when such acceleration is possible. Here, we introduce a framework for similarity search based on characterizing a dataset's entropy and fractal dimension. We prove that searching scales in time with metric entropy (number of covering hyperspheres), if the fractal dimension of the dataset is low, and scales in space with the sum of metric entropy and information-theoretic entropy (randomness of the data). Using these ideas, we present accelerated versions of standard tools, with no loss in specificity and little loss in sensitivity, for use in three domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics (MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search (esFragBag, 10x speedup of FragBag). Our framework can be used to achieve "compressive omics," and the general theory can be readily applied to data science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo

    Hobbes: optimized gram-based methods for efficient read alignment

    Get PDF
    Recent advances in sequencing technology have enabled the rapid generation of billions of bases at relatively low cost. A crucial first step in many sequencing applications is to map those reads to a reference genome. However, when the reference genome is large, finding accurate mappings poses a significant computational challenge due to the sheer amount of reads, and because many reads map to the reference sequence approximately but not exactly. We introduce Hobbes, a new gram-based program for aligning short reads, supporting Hamming and edit distance. Hobbes implements two novel techniques, which yield substantial performance improvements: an optimized gram-selection procedure for reads, and a cache-efficient filter for pruning candidate mappings. We systematically tested the performance of Hobbes on both real and simulated data with read lengths varying from 35 to 100 bp, and compared its performance with several state-of-the-art read-mapping programs, including Bowtie, BWA, mrsFast and RazerS. Hobbes is faster than all other read mapping programs we have tested while maintaining high mapping quality. Hobbes is about five times faster than Bowtie and about 2–10 times faster than BWA, depending on read length and error rate, when asked to find all mapping locations of a read in the human genome within a given Hamming or edit distance, respectively. Hobbes supports the SAM output format and is publicly available at http://hobbes.ics.uci.edu

    Technology dictates algorithms: Recent developments in read alignment

    Full text link
    Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies
    corecore