    Parallelized short read assembly of large genomes using de Bruijn graphs

    BACKGROUND: Next-generation sequencing technologies have given rise to the explosive increase in DNA sequencing throughput, and have promoted the recent development of de novo short read assemblers. However, existing assemblers require high execution times and a large amount of compute resources to assemble large genomes from quantities of short reads. RESULTS: We present PASHA, a parallelized short read assembler using de Bruijn graphs, which takes advantage of hybrid computing architectures consisting of both shared-memory multi-core CPUs and distributed-memory compute clusters to gain efficiency and scalability. Evaluation using three small-scale real paired-end datasets shows that PASHA is able to produce more contiguous high-quality assemblies in shorter time compared to three leading assemblers: Velvet, ABySS and SOAPdenovo. PASHA's scalability for large genome datasets is demonstrated with human genome assembly. Compared to ABySS, PASHA achieves competitive assembly quality with faster execution speed on the same compute resources, yielding an NG50 contig size of 503 with the longest correct contig size of 18,252, and an NG50 scaffold size of 2,294. Moreover, the human assembly is completed in about 21 hours with only modest compute resources. CONCLUSIONS: Developing parallel assemblers for large genomes has been garnering significant research efforts due to the explosive size growth of high-throughput short read datasets. By employing hybrid parallelism consisting of multi-threading on multi-core CPUs and message passing on compute clusters, PASHA is able to assemble the human genome with high quality and in reasonable time using modest compute resources

    Extreme Scale De Novo Metagenome Assembly

    Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomes's genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.Comment: Accepted to SC1

    MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

    A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

    Next Generation Sequencing and the application of parallelized de novo assemblies to characterize non-model species

    Over the past decade significant advancements have been made in the field of genetics and genomics in terms of Next Generation Sequencing (NGS) and analysis. Massively parallel sequencing platforms are able to generate millions of short (50-100bp) reads resulting in gigabytes of raw data that can expand exponentially during the analysis process. Due to these advancements, biologists have been able to conduct complex experiments ranging from the characterization of causative genetic mutations conferring diseases to the characterization of pathogen resistance genes in economically valuable crops. The generation of massive amounts of data have resulted in a demand on computer scientists to answer critical questions regarding data storage, management, and manipulation. Computational Problem One utilization of NGS data allows researchers to develop reference genomes and transcriptomes that serve as a roadmap to biological experiments illustrating various sequence variants, mutations, and genes. De novo assembly is very computationally intensive making it an NP-hard problem as we are searching for the shortest common sequence between a set of reads. The general tree based computational approach for the construction of a de novo reference assembly includes breaking reads down into short kmer fragments, developing a graph of overlapping kmers, and traversing that graph to find the optimal path based on a variety of metrics. The first assemblers introduced to reconstruct de novo genomes were based on a variety of approaches including, prefix tree-based (2007), overlap-extension (2008) and the de Bruijn graph representation (2001) for assembly (Simpson et al. 2009). However, all of these approaches suffered due to computational time and memory limitations associated with single-threaded processes being conducted with a single processor (Simpson et al. 2009). These approaches have been modified and parallelized in a variety of ways in hopes of finding the most accurate, time efficient, and space efficient algorithm. We have developed a massively parallelized random traversal approach that searches for the longest path of overlapping kmers along the graph, representing the most contiguous assembly

    Safe and complete contig assembly via omnitigs

    Contig assembly is the first stage that most assemblers solve when reconstructing a genome from a set of reads. Its output consists of contigs -- a set of strings that are promised to appear in any genome that could have generated the reads. From the introduction of contigs 20 years ago, assemblers have tried to obtain longer and longer contigs, but the following question was never solved: given a genome graph GG (e.g. a de Bruijn, or a string graph), what are all the strings that can be safely reported from GG as contigs? In this paper we finally answer this question, and also give a polynomial time algorithm to find them. Our experiments show that these strings, which we call omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201