23,790 research outputs found

    A new strategy for better genome assembly from very short reads

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the rapid development of the next generation sequencing (NGS) technology, large quantities of genome sequencing data have been generated. Because of repetitive regions of genomes and some other factors, assembly of very short reads is still a challenging issue.</p> <p>Results</p> <p>A novel strategy for improving genome assembly from very short reads is proposed. It can increase accuracies of assemblies by integrating <it>de novo </it>contigs, and produce comparative contigs by allowing multiple references without limiting to genomes of closely related strains. Comparative contigs are used to scaffold <it>de novo </it>contigs. Using simulated and real datasets, it is shown that our strategy can effectively improve qualities of assemblies of isolated microbial genomes and metagenomes.</p> <p>Conclusions</p> <p>With more and more reference genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at <url>http://code.google.com/p/cd-hybrid/</url>.</p

    Computational investigations in eukaryotes genome de novo assembly using short reads.

    Get PDF
    Recently news technologies in molecular biology enormously improved the sequencing data production, making it possible to generate billions of short reads totalizing gibabases of data per experiment. Prices for sequencing are decreasing rapidly and experiments that were impossible in the past because of costs are now being executed. Computational methodologies that were successfully used to solve the genome assembler problem with data obtained by the shotgun strategy, are now inefficient. Efforts are under way to develop new programs. At this moment, a stabilized condition for producing quality assembles is to use paired-end reads to virtually increase the length of reads, but there is a lot of controversy in other points. The works described in literature basically use two strategies: one is based in a high coverage[1] and the other is based in an incremental assembly, using the made pairs with shorter inserts first[2]. Independently of the strategy used the computational resources demanded are actually very high. Basically the present computational solution for the de novo genome assembly involves the generation of a graph of some kind [3], and one because those graphs use as node whole reads or k-mers, and considering that the amount of reads is very expressive; it is possible to infer that the memory resource of the computational system will be very important. Works in literature corroborate this idea showing that multiprocessors computational systems with at least 512 Gb of principal memory were used in de novo projects of eukaryotes [1,2,3]. As an example and benchmark source it is possible use the Panda project, which was executed by a research group consortium at China and generated de novo genome of the giant Panda (Ailuropoda melanoleura) . The project initially produced 231 Gb of raw data, which was reduced to 176 Gb after removing low-quality and duplicated reads. In the de novo assembly process just 134 Gb were used. Those bases were distributed in approximately 3 billions short reads. After the assembly, 200604 contigs were generated and 5701 multicontig scaffolds were obtained using 124336 contigs. The N50 was respectively . 36728 bp and 1.22 Mb for contigs and scaffolds. The present work investigated the computational demands of de novo assembly of eukaryotes genomes, reproducing the results of the Panda project. The strategy used was incremental as implemented in the SOAPdenovo software, which basically divides the assembly process in four steps: pre-graph to construction of kmer-graph; contig to eliminate errors and output contigs, map to map reads in the contigs and scaff to scaffold contigs. It used a NUMA (non-uniform memory access) computational system with 8 six-core processors with hyperthread tecnology and 512 Gb of RAM (random access memory), and the consumption of resources as memory and processor time were pointed for every steps in the process. The incremental strategy to solve the problem seems practical and can produce effective results. At this moment a work is in progress which is investigating a new methodology to group the short reads together using the entropy concept. It is possible that assemblies with better quality will be generated, because this methodology initially uses more informative reads. References [1] Gnerre et. al.; High-quality draft assemblies of mammalian genomes from massively parallel sequence data, Proceedings of the National Academy of Sciences USA, v. 108, n. 4, p. 1513-1518, 2010 [2] Li et. al.; The sequence and de novo assembly of the giant panda genome, Nature, v. 463, p. 311-317, 2010 [3] Schatz et. al.; Assembly of large genomes using second-generation sequencing, Genome Research, v. 20, p. 1165-1173, 2010X-MEETING 2011

    Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions

    Full text link
    Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages, and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we 1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and 2) provide guidelines for determining the appropriate tools for each step. We analyze various combinations of different tools and expose the tradeoffs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, in order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201

    MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

    Full text link
    A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

    Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

    Full text link
    Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page

    Cerulean: A hybrid assembly using high throughput short and long reads

    Full text link
    Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats. Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013
    corecore