Search CORE

186 research outputs found

SEED: efficient clustering of next-generation sequences.

Author: Bao Ergude
Girke Thomas
Jiang Tao
Kaloshian Isgouhi
Publication venue: eScholarship, University of California
Publication date: 02/08/2011
Field of study

MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

PubMed Central

eScholarship - University of California

Evaluating the Fidelity of De Novo Short Read Metagenomic Assembly Using Simulated Data

Author: A Brady
A Charuvaka
A Lopez-Bueno
AE Darling
Andrés Moya
D Hernandez
DB Jaffe
DB Rusch
DC Richter
DD Sommer
DH Haft
DH Huson
DR Zerbino
EW Myers
F Meyer
GG Sutton
GW Tyson
I Letunic
I Maccallum
J Laserson
J Qin
JA Huber
JC Dohm
JC Dohm
JC Wooley
JO Korbel
Jonathan H. Badger
JR Miller
JR Miller
JT Simpson
K Liolios
K Mavromatis
KE Wommack
L Krause
M de la Bastide
M Margulies
M Pop
M Stark
M Wu
Miguel Pignatelli
MJ Chaisson
NN Diaz
OU Nalbantoglu
PJ Turnbaugh
PJ Turnbaugh
R Li
R Seshadri
RD Finn
RL Tatusov
RL Warren
S Batzoglou
S Levy
S Yooseph
SM Huse
SR Gill
T Schoenfeld
TS Ghosh
VM Markowitz
WJ Kent
WR Jeck
X Huang
X Huang
Y Ye
Publication venue: Public Library of Science
Publication date: 23/05/2011
Field of study

A frequent step in metagenomic data analysis comprises the assembly of the sequenced reads. Many assembly tools have been published in the last years targeting data coming from next-generation sequencing (NGS) technologies but these assemblers have not been designed for or tested in multi-genome scenarios that characterize metagenomic studies. Here we provide a critical assessment of current de novo short reads assembly tools in multi-genome scenarios using complex simulated metagenomic data. With this approach we tested the fidelity of different assemblers in metagenomic studies demonstrating that even under the simplest compositions the number of chimeric contigs involving different species is noticeable. We further showed that the assembly process reduces the accuracy of the functional classification of the metagenomic data and that these errors can be overcome raising the coverage of the studied metagenome. The results presented here highlight the particular difficulties that de novo genome assemblers face in multi-genome scenarios demonstrating that these difficulties, that often compromise the functional classification of the analyzed data, can be overcome with a high sequencing effort

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Evaluation of next-generation sequencing software in mapping and assembly

Author: A Bashir
A Bateman
AC McHardy
AD Smith
B Langmead
BinBin Wang
C Trapnell
CA Tilford
D Campagna
D Hernandez
D Weese
DR Bentley
DR Zerbino
DS Horner
DW Bryant Jr
ER Mardis
ER Mardis
ES Lander
EW Myers
F Sanger
H Jiang
H Li
H Li
H Li
H Lin
HL Eaves
J Butler
JC Dohm
JC Venter
JO Korbel
JR Miller
JR Miller
JT Simpson
JT Simpson
K Chen
KE Holt
L Engstrand
L Noe
M Margulies
M Pop
M Pop
MC Schatz
MJ Chaisson
ML Metzker
MS Hossain
N Homer
N Malhis
NL Clement
O Morozova
O Morozova
P Flicek
P Flicek
P Medvedev
PA Pevzner
PJ Campbell
PJ Hurd
R Staden
RF Service
RL Warren
RQ Li
RQ Li
Rui Jiang
SC Schuster
SM Rumble
Suying Bao
WingKeung Kwan
WJ Ansorge
WR Jeck
Xu Ma
Y Chen
YJ Kim
You-Qiang Song
Z Ning
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.published_or_final_versio

Crossref

HKU Scholars Hub

Targeted Assembly of Short Sequence Reads

Author: H Li
H Li
H Li
JD Freeman
JT Simpson
LD Stein
M Rasmussen
Olivier Lespinet
R Goya
R Li
R Li
R Morin
René L. Warren
RK Nam
RL Warren
RL Warren
RM Durbin
Robert A. Holt
S Nacu
SP Shah
WR Jeck
Publication venue
Publication date: 01/01/2011
Field of study

As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants, by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled strin-gently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming ge-nomic mutations, polymorphism, fusion and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Simon Fraser University Institutional Repository

Nature Precedings

Gene-Boosted Assembly of a Novel Bacterial Genome from Very Short Reads

Author: A Camilli
AL Delcher
CK Stover
D Hernandez
D Zerbino
Daniel D. Sommer
Daniela Puiu
DB Jaffe
DD Sommer
DG Lee
DL Kasper
E Drenkard
ER Mardis
EW Myers
G Robertson
H Kulasakara
J Butler
LR Hoffman
LW Hillier
M Margulies
M Merighi
M Pop
MG Smith
MJ Chaisson
ML Metzker
N Dasgupta
N Whiteford
P Rice
RL Warren
S Batzoglou
S Kurtz
SF Altschul
SM Goldberg
Steven L. Salzberg
U Romling
Vincent T. Lee
VT Lee
William Stafford Noble
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

QSRA – a quality-value guided de novo short read assembler

Author: D Hernandez
Douglas W Bryant
DR Zerbino
J Butler
J Dohm
J Kent
MJ Chaisson
NG de Bruijn
R Cronn
R Warren
Todd C Mockler
W Jeck
Weng-Keen Wong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. Results We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. Conclusion QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Comparison of DNA sequence assembly algorithms using mixed data sources

Author: Bamidele-Abegunde Tejumoluwa
Publication venue: 'University of Saskatchewan Library'
Publication date
Field of study

DNA sequence assembly is one of the fundamental areas of bioinformatics. It involves the correct formation of a genome sequence from its DNA fragments ("reads") by aligning and merging the fragments. There are different sequencing technologies -- some support long DNA reads and the others, shorter DNA reads. There are sequence assembly programs specifically designed for these different types of raw sequencing data. This work explores and experiments with these different types of assembly software in order to compare their performance on the type of data for which they were designed, as well as their performance on data for which they were not designed, and on mixed data. Such results are useful for establishing good procedures and tools for sequence assembly in the current genomic environment where read data of different lengths are available. This work also investigates the effect of the presence or absence of quality information on the results produced by sequence assemblers. Five strategies were used in this research for assembling mixed data sets and the testing was done using a collection of real and artificial data sets for six bacterial organisms. The results show that there is a broad range in the ability of some DNA sequence assemblers to handle data from various sequencing technologies, especially data other than the kind they were designed for. For example, the long-read assemblers PHRAP and MIRA produced good results from assembling 454 data. The results also show the importance of having an effective methodology for assembling mixed data sets. It was found that combining contiguous sequences obtained from short-read assemblers with long DNA reads, and then assembling this combination using long-read assemblers was the most appropriate approach for assembling mixed short and long reads. It was found that the results from assembling the mixed data sets were better than the results obtained from separately assembling individual data from the different sequencing technologies. DNA sequence assemblers which do not depend on the availability of quality information were used to test the effect of the presence of quality values when assembling data. The results show that regardless of the availability of quality information, good results were produced in most of the assemblies. In more general terms, this work shows that the approach or methodology used to assemble DNA sequences from mixed data sources makes a lot of difference in the type of results obtained, and that a good choice of methodology can help reduce the amount of effort spent on a DNA sequence assembly project

eCommons@USASK

University of Saskatchewan Research Archive

Assembly algorithms for next-generation sequencing data

Author: Koren Sergey
Miller Jason R.
Sutton Granger
Publication venue: Elsevier Inc.
Publication date: 01/06/2010
Field of study

AbstractThe emergence of next-generation sequencing platforms led to resurgence of research in whole-genome shotgun assembly algorithms and software. DNA sequencing data from the Roche 454, Illumina/Solexa, and ABI SOLiD platforms typically present shorter read lengths, higher coverage, and different error profiles compared with Sanger sequencing data. Since 2005, several assembly software packages have been created or revised specifically for de novo assembly of next-generation sequencing data. This review summarizes and compares the published descriptions of packages named SSAKE, SHARCGS, VCAKE, Newbler, Celera Assembler, Euler, Velvet, ABySS, AllPaths, and SOAPdenovo. More generally, it compares the two standard methods known as the de Bruijn graph approach and the overlap/layout/consensus approach to assembly

Elsevier - Publisher Connector

PubMed Central

Crystallizing short-read assemblies around seeds

Author: Azimi Navid
Hossain Mohammad Sajjad
Skiena Steven
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background New short-read sequencing technologies produce enormous volumes of 25–30 base paired-end reads. The resulting reads have vastly different characteristics than produced by Sanger sequencing, and require different approaches than the previous generation of sequence assemblers. In this paper, we present a short-read de novo assembler particularly targeted at the new ABI SOLiD sequencing technology. Results This paper presents what we believe to be the first de novo sequence assembly results on <it>real </it>data from the emerging SOLiD platform, introduced by <it>Applied Biosystems</it>. Our assembler SHORTY augments short-paired reads using a trivially small number (5 – 10) of <it>seeds </it>of length 300 – 500 bp. These seeds enable us to produce significant assemblies using short-read coverage no more than 100×, which can be obtained in a single run of these high-capacity sequencers. SHORTY exploits two ideas which we believe to be of interest to the short-read assembly community: (1) using single seed reads to crystallize assemblies, and (2) estimating intercontig distances accurately from multiple spanning paired-end reads. Conclusion We demonstrate effective assemblies (N50 contig sizes ~40 kb) of three different bacterial species using simulated SOLiD data. Sequencing artifacts limit our performance on real data, however our results on this data are substantially better than those achieved by competing assemblers.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central