Search CORE

27 research outputs found

QSRA – a quality-value guided de novo short read assembler

Author: D Hernandez
Douglas W Bryant
DR Zerbino
J Butler
J Dohm
J Kent
MJ Chaisson
NG de Bruijn
R Cronn
R Warren
Todd C Mockler
W Jeck
Weng-Keen Wong
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background New rapid high-throughput sequencing technologies have sparked the creation of a new class of assembler. Since all high-throughput sequencing platforms incorporate errors in their output, short-read assemblers must be designed to account for this error while utilizing all available data. Results We have designed and implemented an assembler, Quality-value guided Short Read Assembler, created to take advantage of quality-value scores as a further method of dealing with error. Compared to previous published algorithms, our assembler shows significant improvements not only in speed but also in output quality. Conclusion QSRA generally produced the highest genomic coverage, while being faster than VCAKE. QSRA is extremely competitive in its longest contig and N50/N80 contig lengths, producing results of similar quality to those of EDENA and VELVET. QSRA provides a step closer to the goal of de novo assembly of complex genomes, improving upon the original VCAKE algorithm by not only drastically reducing runtimes but also increasing the viability of the assembly algorithm through further error handling capabilities.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evaluation of next-generation sequencing software in mapping and assembly

Author: A Bashir
A Bateman
AC McHardy
AD Smith
B Langmead
BinBin Wang
C Trapnell
CA Tilford
D Campagna
D Hernandez
D Weese
DR Bentley
DR Zerbino
DS Horner
DW Bryant Jr
ER Mardis
ER Mardis
ES Lander
EW Myers
F Sanger
H Jiang
H Li
H Li
H Li
H Lin
HL Eaves
J Butler
JC Dohm
JC Venter
JO Korbel
JR Miller
JR Miller
JT Simpson
JT Simpson
K Chen
KE Holt
L Engstrand
L Noe
M Margulies
M Pop
M Pop
MC Schatz
MJ Chaisson
ML Metzker
MS Hossain
N Homer
N Malhis
NL Clement
O Morozova
O Morozova
P Flicek
P Flicek
P Medvedev
PA Pevzner
PJ Campbell
PJ Hurd
R Staden
RF Service
RL Warren
RQ Li
RQ Li
Rui Jiang
SC Schuster
SM Rumble
Suying Bao
WingKeung Kwan
WJ Ansorge
WR Jeck
Xu Ma
Y Chen
YJ Kim
You-Qiang Song
Z Ning
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.published_or_final_versio

Crossref

HKU Scholars Hub

Analysis of quality raw data of second generation sequencers with Quality Assessment Software

Author: A Smith
Adriana R Carneiro
Artur Silva
B Ewing
B Ewing
D Gordon
D Hernandez
DR Zerbino
DW Bryant
E Lande
H Li
J Butler
J Dohm
Jan Baumbach
M Chaisson
Maria PC Schneider
Rommel TJ Ramos
S Bentley
SC Schuster
V Pandey
Vasco Azevedo
W Jeck
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Second generation technologies have advantages over Sanger; however, they have resulted in new challenges for the genome construction process, especially because of the small size of the reads, despite the high degree of coverage. Independent of the program chosen for the construction process, DNA sequences are superimposed, based on identity, to extend the reads, generating contigs; mismatches indicate a lack of homology and are not included. This process improves our confidence in the sequences that are generated. Findings We developed Quality Assessment Software, with which one can review graphs showing the distribution of quality values from the sequencing reads. This software allow us to adopt more stringent quality standards for sequence data, based on quality-graph analysis and estimated coverage after applying the quality filter, providing acceptable sequence coverage for genome construction from short reads. Conclusions Quality filtering is a fundamental step in the process of constructing genomes, as it reduces the frequency of incorrect alignments that are caused by measuring errors, which can occur during the construction process due to the size of the reads, provoking misassemblies. Application of quality filters to sequence data, using the software Quality Assessment, along with graphing analyses, provided greater precision in the definition of cutoff parameters, which increased the accuracy of genome construction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

MPG.PuRe

Comparing De Novo Genome Assembly: The Long and Short of It

Author: A Phillippy
B Mishra
B Schmidt
Bud Mishra
C Alkan
C Aston
D Bryant
D Hernandez
D Schwartz
D Sommer
DR Zerbino
DR Zerbino
EW Myers
F Sanger
FR Blattner
G Narzisi
GG Sutton
Giuseppe Narzisi
IT Paulsen
J Butler
J Tarhio
JC Dohm
JC Mullikin
JM Kidd
JR Miller
JT Simpson
M Antoniotti
M Eppinger
M Hossain
M Wu
MJ Chaisson
P Green
P Medvedev
PA Pevzner
PN Ariyaratne
R Li
RL Warren
RW Hung
S Batzoglou
S Boisvert
S Gnerre
S Kim
S Kurtz
SL Salzberg
SR Gill
SS Hall
Stein Aerts
T Anantharaman
T Baba
TS Anantharaman
TS Anantharaman
WR Jeck
X Huang
X Huang
Publication venue: Public Library of Science
Publication date: 29/04/2011
Field of study

Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers – both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies – are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing “next-generation” assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Arapan-S: a fast and highly accurate whole-genome assembly software for viruses and small genomes

Author: B Chevreux
DD Sommer
DR Zerbino
DR Zerbino
DW Bryant
EW Myers
GG Sutton
I Maccallum
J Butler
JT Simpson
MJ Chaisson
Mohammed Sahli
P Medvedev
PA Pevzner
R Li
RL Warren
Tetsuo Shibuya
X Huang
X Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

PASQUAL: Parallel Techniques for Next Generation Genome Sequence Assembly

Author: David A. Bader
Henning Meyerhenke
Pushkar R. Pande
Xing Liu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Paired is better: local assembly algorithms for NGS paired reads and applications to RNA-Seq

Author: Nadalin Francesca
Publication venue: Università degli Studi di Udine
Publication date: 12/05/2014
Field of study

The analysis of biological sequences is one of the main research areas of Bioinformatics. Sequencing data are the input for almost all types of studies concerning genomic as well as transcriptomic sequences, and sequencing experiments should be conceived specifically for each type of application. The challenges posed by fundamental biological questions are usually addressed by firstly aligning or assemblying the reads produced by new sequencing technologies. Assembly is the first step when a reference sequence is not available. Alignment of genomic reads towards a known genome is fundamental, e.g., to find the differences among organisms of related species, and to detect mutations proper of the so-called "diseases of the genome". Alignment of transcriptomic reads against a reference genome, allows to detect the expressed genes as well as to annotate and quantify alternative transcripts. In this thesis we overview the approaches proposed in literature for solving the above mentioned problems. In particular, we deeply analyze the sequence assembly problem, with particular emphasys on genome reconstruction, both from a more theoretical point of view and in light of the characteristics of sequencing data produced by state-of-the-art technologies. We also review the main steps in a pipeline for the analysis of the transcriptome, that is, alignment, assembly, and transcripts quantification, with particular emphasys on the opportunities given by RNA-Seq technologies in enhancing precision. The thesis is divided in two parts, the first one devoted to the study of local assembly methods for Next Generation Sequencing data, the second one concerning the development of tools for alignment of RNA-Seq reads and transcripts quantification. The permanent theme is the use of paired reads in all fields of applications discussed in this thesis. In particular, we emphasyze the benefits of assemblying inserts from paired reads in a wide range of applications, from de novo assembly, to the analysis of RNA. The main contribution of this thesis lies in the introduction of innovative tools, based on well-studied heuristics fine tuned on the data. Software is always tested to specifically assess the correctness of prediction. The aim is to produce robust methods, that, having low false positives rate, produce a certified output characterized by high specificity.openDottorato di ricerca in InformaticaopenNadalin, Francesc

Archivio istituzionale della ricerca - Università degli Studi di Udine

A base composition analysis of natural patterns for the preprocessing of metagenome sequences

Author: Ali Hesham
Bastola Dhundy Raj
Bonham-Carter Oliver
Publication venue: DigitalCommons@UNO
Publication date: 07/10/2012
Field of study

Background: On the pretext that sequence reads and contigs often exhibit the same kinds of base usage that is also observed in the sequences from which they are derived, we offer a base composition analysis tool. Our tool uses these natural patterns to determine relatedness across sequence data. We introduce spectrum sets (sets of motifs) which are permutations of bacterial restriction sites and the base composition analysis framework to measure their proportional content in sequence data. We suggest that this framework will increase the efficiency during the pre-processing stages of metagenome sequencing and assembly projects. Results: Our method is able to differentiate organisms and their reads or contigs. The framework shows how to successfully determine the relatedness between these reads or contigs by comparison of base composition. In particular, we show that two types of organismal-sequence data are fundamentally different by analyzing their spectrum set motif proportions (coverage). By the application of one of the four possible spectrum sets, encompassing all known restriction sites, we provide the evidence to claim that each set has a different ability to differentiate sequence data. Furthermore, we show that the spectrum set selection having relevance to one organism, but not to the others of the data set, will greatly improve performance of sequence differentiation even if the fragment size of the read, contig or sequence is not lengthy. Conclusions: We show the proof of concept of our method by its application to ten trials of two or three freshly selected sequence fragments (reads and contigs) for each experiment across the six organisms of our set. Here we describe a novel and computationally effective pre-processing step for metagenome sequencing and assembly tasks. Furthermore, our base composition method has applications in phylogeny where it can be used to infer evolutionary distances between organisms based on the notion that related organisms often have much conserved code

Springer - Publisher Connector

PubMed Central

The University of Nebraska, Omaha

A base composition analysis of natural patterns for the preprocessing of metagenome sequences

Author: D Richter
D Zerbino
Dhundy Bastola
DR Zerbino
DW Bryant Jr
EMP Lamprea-Burgunder
F Hildebrand
Hesham Ali
J Lightfield
J Schmutz
JT Simpson
K Paszkiewicz
M Gelfand
M Hossain
O Bonham-Carter
Oliver Bonham-Carter
Q Peng
R Development Core Team
R Garg
R Hershberg
R Kolde
R Li
R Li
R Roberts
RC Edgar
S Schuster
SF Altschul
SW Kembel
V Kunin
W Zhang
WJ Kent
Y Ji
Y Wang
YW Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

De novo assembly of Euphorbia fischeriana root transcriptome identifies prostratin pathway related genes

Author: A DeLong
Brett Chapman
CJ Mau
D Hernandez
D Natarajan
Deyou Qiu
DR Zerbino
DW Bryant Jr
E Gerber
G Brooks
Gabriel Keeble-Gagnère
GW Qin
H Kasahara
H Li
H Li
HJ Brown
J Kirby
J Kulkosky
K Lagesen
KR Gustafson
L Li
M Hezareh
Matthew I Bellgard
MJ Chaisson
MR Wildung
N Marquez
Nan Zhang
PA Cox
PA Cox
Paula Moolhuijzen
Qi Tang
R Li
R Schmid
RJ Gulakowski
Roberto A Barrero
S Bocklandt
S Rozen
TM Lowe
W Li
Yanfang Yang
YB Wang
Z Szallasi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background Euphorbia fischeriana is an important medicinal plant found in Northeast China. The plant roots contain many medicinal compounds including 12-deoxyphorbol-13-acetate, commonly known as prostratin that is a phorbol ester from the tigliane diterpene series. Prostratin is a protein kinase C activator and is effective in the treatment of Human Immunodeficiency Virus (HIV) by acting as a latent HIV activator. Latent HIV is currently the biggest limitation for viral eradication. The aim of this study was to sequence, assemble and annotate the E. fischeriana transcriptome to better understand the potential biochemical pathways leading to the synthesis of prostratin and other related diterpene compounds. Results In this study we conducted a high throughput RNA-seq approach to sequence the root transcriptome of E. fischeriana. We assembled 18,180 transcripts, of these the majority encoded protein-coding genes and only 17 transcripts corresponded to known RNA genes. Interestingly, we identified 5,956 protein-coding transcripts with high similarity (>=75%) to Ricinus communis, a close relative to E. fischeriana. We also evaluated the conservation of E. fischeriana genes against EST datasets from the Euphorbeacea family, which included R. communis, Hevea brasiliensis and Euphorbia esula. We identified a core set of 1,145 gene clusters conserved in all four species and 1,487 E. fischeriana paralogous genes. Furthermore, we screened E. fischeriana transcripts against an in-house reference database for genes implicated in the biosynthesis of upstream precursors to prostratin. This identified 24 and 9 candidate transcripts involved in the terpenoid and diterpenoid biosyntehsis pathways, respectively. The majority of the candidate genes in these pathways presented relatively low expression levels except for 1-hydroxy-2-methyl-2-(E)-butenyl 4-diphosphate synthase (HDS) and isopentenyl diphosphate/dimethylallyl diphosphate synthase (IDS), which are required for multiple downstream pathways including synthesis of casbene, a proposed precursor to prostratin. Conclusion The resources generated in this study provide new insights into the upstream pathways to the synthesis of prostratin and will likely facilitate functional studies aiming to produce larger quantities of this compound for HIV research and/or treatment of patients

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Queensland University of Technology ePrints Archive

Research Repository

espace@Curtin