Search CORE

23,790 research outputs found

A new strategy for better genome assembly from very short reads

Author: Ding Guohui
Ji Yan
Li Yixue
Shi Yixiang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background With the rapid development of the next generation sequencing (NGS) technology, large quantities of genome sequencing data have been generated. Because of repetitive regions of genomes and some other factors, assembly of very short reads is still a challenging issue. Results A novel strategy for improving genome assembly from very short reads is proposed. It can increase accuracies of assemblies by integrating <it>de novo </it>contigs, and produce comparative contigs by allowing multiple references without limiting to genomes of closely related strains. Comparative contigs are used to scaffold <it>de novo </it>contigs. Using simulated and real datasets, it is shown that our strategy can effectively improve qualities of assemblies of isolated microbial genomes and metagenomes. Conclusions With more and more reference genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at <url>http://code.google.com/p/cd-hybrid/</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Computational investigations in eukaryotes genome de novo assembly using short reads.

Author: CINTRA L. C.
Publication venue
Publication date: 24/01/2020
Field of study

Recently news technologies in molecular biology enormously improved the sequencing data production, making it possible to generate billions of short reads totalizing gibabases of data per experiment. Prices for sequencing are decreasing rapidly and experiments that were impossible in the past because of costs are now being executed. Computational methodologies that were successfully used to solve the genome assembler problem with data obtained by the shotgun strategy, are now inefficient. Efforts are under way to develop new programs. At this moment, a stabilized condition for producing quality assembles is to use paired-end reads to virtually increase the length of reads, but there is a lot of controversy in other points. The works described in literature basically use two strategies: one is based in a high coverage[1] and the other is based in an incremental assembly, using the made pairs with shorter inserts first[2]. Independently of the strategy used the computational resources demanded are actually very high. Basically the present computational solution for the de novo genome assembly involves the generation of a graph of some kind [3], and one because those graphs use as node whole reads or k-mers, and considering that the amount of reads is very expressive; it is possible to infer that the memory resource of the computational system will be very important. Works in literature corroborate this idea showing that multiprocessors computational systems with at least 512 Gb of principal memory were used in de novo projects of eukaryotes [1,2,3]. As an example and benchmark source it is possible use the Panda project, which was executed by a research group consortium at China and generated de novo genome of the giant Panda (Ailuropoda melanoleura) . The project initially produced 231 Gb of raw data, which was reduced to 176 Gb after removing low-quality and duplicated reads. In the de novo assembly process just 134 Gb were used. Those bases were distributed in approximately 3 billions short reads. After the assembly, 200604 contigs were generated and 5701 multicontig scaffolds were obtained using 124336 contigs. The N50 was respectively . 36728 bp and 1.22 Mb for contigs and scaffolds. The present work investigated the computational demands of de novo assembly of eukaryotes genomes, reproducing the results of the Panda project. The strategy used was incremental as implemented in the SOAPdenovo software, which basically divides the assembly process in four steps: pre-graph to construction of kmer-graph; contig to eliminate errors and output contigs, map to map reads in the contigs and scaff to scaffold contigs. It used a NUMA (non-uniform memory access) computational system with 8 six-core processors with hyperthread tecnology and 512 Gb of RAM (random access memory), and the consumption of resources as memory and processor time were pointed for every steps in the process. The incremental strategy to solve the problem seems practical and can produce effective results. At this moment a work is in progress which is investigating a new methodology to group the short reads together using the entropy concept. It is possible that assemblies with better quality will be generated, because this methodology initially uses more informative reads. References [1] Gnerre et. al.; High-quality draft assemblies of mammalian genomes from massively parallel sequence data, Proceedings of the National Academy of Sciences USA, v. 108, n. 4, p. 1513-1518, 2010 [2] Li et. al.; The sequence and de novo assembly of the giant panda genome, Nature, v. 463, p. 311-317, 2010 [3] Schatz et. al.; Assembly of large genomes using second-generation sequencing, Genome Research, v. 20, p. 1165-1173, 2010X-MEETING 2011

Repository Open Access to Scientific Information from Embrapa

Recommended from our members

Deconvolute individual genomes from metagenome sequences through short read clustering.

Author: Deng Li
Li Kexue
Lu Yakang
Shi Lizhen
Wang Lili
Wang Zhong
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Metagenome assembly from short next-generation sequencing data is a challenging process due to its large scale and computational complexity. Clustering short reads by species before assembly offers a unique opportunity for parallel downstream assembly of genomes with individualized optimization. However, current read clustering methods suffer either false negative (under-clustering) or false positive (over-clustering) problems. Here we extended our previous read clustering software, SpaRC, by exploiting statistics derived from multiple samples in a dataset to reduce the under-clustering problem. Using synthetic and real-world datasets we demonstrated that this method has the potential to cluster almost all of the short reads from genomes with sufficient sequencing coverage. The improved read clustering in turn leads to improved downstream genome assembly quality

eScholarship - University of California

Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions

Author: Alkan Can
Cali Damla Senol
Ghose Saugata
Kim Jeremie S.
Mutlu Onur
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages, and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we 1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and 2) provide guidelines for determining the appropriate tools for each step. We analyze various combinations of different tools and expose the tradeoffs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, in order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201

arXiv.org e-Print Archive

Crossref

Bilkent University Institutional Repository

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

Author: Li Yang
XifengYan
Publication venue
Publication date: 25/05/2015
Field of study

A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading assembly algorithms: counting the number of occurrences of k-mers (length-k substrings in sequences). The counting results are critical for many components in assembly (e.g. variants detection and read error correction). For large genomes, the k-mer counting task can easily consume a huge amount of memory, making it impossible for large-scale parallel assembly on commodity servers. In this paper, we develop MSPKmerCounter, a disk-based approach, to efficiently perform k-mer counting for large genomes using a small amount of memory. Our approach is based on a novel technique called Minimum Substring Partitioning (MSP). MSP breaks short reads into multiple disjoint partitions such that each partition can be loaded into memory and processed individually. By leveraging the overlaps among the k-mers derived from the same short read, MSP can achieve astonishing compression ratio so that the I/O cost can be significantly reduced. For the task of k-mer counting, MSPKmerCounter offers a very fast and memory-efficient solution. Experiment results on large real-life short reads data sets demonstrate that MSPKmerCounter can achieve better overall performance than state-of-the-art k-mer counting approaches. MSPKmerCounter is available at http://www.cs.ucsb.edu/~yangli/MSPKmerCounte

arXiv.org e-Print Archive

CiteSeerX

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly

Author: Depristo
Durbin
Gingeras
H. Li
Homer
Idury
Iqbal
Lam
Levy
Myers
Myers
Myers
Peltola
Pevzner
Staden
Zerbino
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2012
Field of study

Motivation: Eugene Myers in his string graph paper (Myers, 2005) suggested that in a string graph or equivalently a unitig graph, any path spells a valid assembly. As a string/unitig graph also encodes every valid assembly of reads, such a graph, provided that it can be constructed correctly, is in fact a lossless representation of reads. In principle, every analysis based on whole-genome shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) calling, can also be achieved with unitigs. Results: To explore the feasibility of using de novo assembly in the context of resequencing, we developed a de novo assembler, fermi, that assembles Illumina short reads into unitigs while preserving most of information of the input reads. SNPs and INDELs can be called by mapping the unitigs against a reference genome. By applying the method on 35-fold human resequencing data, we showed that in comparison to the standard pipeline, our approach yields similar accuracy for SNP calling and better results for INDEL calling. It has higher sensitivity than other de novo assembly based methods for variant calling. Our work suggests that variant calling with de novo assembly be a beneficial complement to the standard variant calling pipeline for whole-genome resequencing. In the methodological aspects, we proposed FMD-index for forward-backward extension of DNA sequences, a fast algorithm for finding all super-maximal exact matches and one-pass construction of unitigs from an FMD-index. Availability: http://github.com/lh3/fermi Contact: [email protected]: Rev2: submitted version with minor improvements; 7 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Cerulean: A hybrid assembly using high throughput short and long reads

Author: D.R. Zerbino
E.W. Myers
E.W. Myers
F.J. Ribeiro
J. Schmutz
J.T. Simpson
J.T. Simpson
M. Eisenstein
M.J. Chaisson
M.J. Chaisson
P.A. Pevzner
R. Staden
R.M. Idury
S. Koren
T.D. Wu
Publication venue
Publication date: 01/01/2013
Field of study

Genome assembly using high throughput data with short reads, arguably, remains an unresolvable task in repetitive genomes, since when the length of a repeat exceeds the read length, it becomes difficult to unambiguously connect the flanking regions. The emergence of third generation sequencing (Pacific Biosciences) with long reads enables the opportunity to resolve complicated repeats that could not be resolved by the short read data. However, these long reads have high error rate and it is an uphill task to assemble the genome without using additional high quality short reads. Recently, Koren et al. 2012 proposed an approach to use high quality short reads data to correct these long reads and, thus, make the assembly from long reads possible. However, due to the large size of both dataset (short and long reads), error-correction of these long reads requires excessively high computational resources, even on small bacterial genomes. In this work, instead of error correction of long reads, we first assemble the short reads and later map these long reads on the assembly graph to resolve repeats. Contribution: We present a hybrid assembly approach that is both computationally effective and produces high quality assemblies. Our algorithm first operates with a simplified version of the assembly graph consisting only of long contigs and gradually improves the assembly by adding smaller contigs in each iteration. In contrast to the state-of-the-art long reads error correction technique, which requires high computational resources and long running time on a supercomputer even for bacterial genome datasets, our software can produce comparable assembly using only a standard desktop in a short running time.Comment: Peer-reviewed and presented as part of the 13th Workshop on Algorithms in Bioinformatics (WABI2013

arXiv.org e-Print Archive

Crossref