Search CORE

132 research outputs found

BFAST: An Alignment Tool for Large Scale Genome Resequencing

Author: A Cox
B Langmead
B Ma
Barry Merriman
CA Hutchison 3rd
Chad Creighton
DR Bentley
DR Smith
F Sanger
H Li
H Li
L Ilie
M Margulies
N Homer
Nils Homer
R Li
RA Holt
SF Altschul
SM Rumble
SM Rumble
Stanley F. Nelson
TF Smith
WJ Kent
Y Sun
Z Ning
Publication venue: Public Library of Science
Publication date: 01/11/2009
Field of study

BACKGROUND:The new generation of massively parallel DNA sequencers, combined with the challenge of whole human genome resequencing, result in the need for rapid and accurate alignment of billions of short DNA sequence reads to a large reference genome. Speed is obviously of great importance, but equally important is maintaining alignment accuracy of short reads, in the 25-100 base range, in the presence of errors and true biological variation. METHODOLOGY:We introduce a new algorithm specifically optimized for this task, as well as a freely available implementation, BFAST, which can align data produced by any of current sequencing platforms, allows for user-customizable levels of speed and accuracy, supports paired end data, and provides for efficient parallel and multi-threaded computation on a computer cluster. The new method is based on creating flexible, efficient whole genome indexes to rapidly map reads to candidate alignment locations, with arbitrary multiple independent indexes allowed to achieve robustness against read errors and sequence variants. The final local alignment uses a Smith-Waterman method, with gaps to support the detection of small indels. CONCLUSIONS:We compare BFAST to a selection of large-scale alignment tools -- BLAT, MAQ, SHRiMP, and SOAP -- in terms of both speed and accuracy, using simulated and real-world datasets. We show BFAST can achieve substantially greater sensitivity of alignment in the context of errors and true variants, especially insertions and deletions, and minimize false mappings, while maintaining adequate speed compared to other current methods. We show BFAST can align the amount of data needed to fully resequence a human genome, one billion reads, with high sensitivity and accuracy, on a modest computer cluster in less than 24 hours. BFAST is available at (http://bfast.sourceforge.net)

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ParMap, an algorithm for the identification of small genomic insertions and deletions in nextgen sequencing data

Author: A Gnirke
Adolfo A Ferrando
AV Dalca
Hossein Khiabanian
J Shendure
JD McPherson
KJ McKernan
N Homer
P Medvedev
P Van Vlierberghe
Pieter Van Vlierberghe
Raul Rabadan
RM Kuhn
SM Rumble
Teresa Palomero
Publication venue: BioMed Central
Publication date: 01/05/2010
Field of study

Abstract Background Next-generation sequencing produces high-throughput data, albeit with greater error and shorter reads than traditional Sanger sequencing methods. This complicates the detection of genomic variations, especially, small insertions and deletions. Findings Here we describe ParMap, a statistical algorithm for the identification of complex genetic variants, such as small insertion and deletions, using partially mapped reads in nextgen sequencing data. Conclusions We report ParMap's successful application to the mutation analysis of chromosome X exome-captured leukemia DNA samples.</p

Crossref

Directory of Open Access Journals

PubMed Central

Local alignment of generalized k-base encoded DNA sequence

Author: ABI
ABI
Barry Merriman
D Smith
DJ Lipman
H Li
MJ Clark
N Homer
N Homer
Nils Homer
O Gotoh
R Hamming
S Needleman
SM Rumble
Stanley F Nelson
T Smith
W Kent
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background DNA sequence comparison is a well-studied problem, in which two DNA sequences are compared using a weighted edit distance. Recent DNA sequencing technologies however observe an encoded form of the sequence, rather than each DNA base individually. The encoded DNA sequence may contain technical errors, and therefore encoded sequencing errors must be incorporated when comparing an encoded DNA sequence to a reference DNA sequence. Results Although two-base encoding is currently used in practice, many other encoding schemes are possible, whereby two ore more bases are encoded at a time. A generalized <it>k</it>-base encoding scheme is presented, whereby feasible higher order encodings are better able to differentiate errors in the encoded sequence from true DNA sequence variants. A generalized version of the previous two-base encoding DNA sequence comparison algorithm is used to compare a <it>k</it>-base encoded sequence to a DNA reference sequence. Finally, simulations are performed to evaluate the power, the false positive and false negative SNP discovery rates, and the performance time of <it>k</it>-base encoding compared to previous methods as well as to the standard DNA sequence comparison algorithm. Conclusions The novel generalized <it>k</it>-base encoding scheme and resulting local alignment algorithm permits the development of higher fidelity ligation-based next generation sequencing technology. This bioinformatic solution affords greater robustness to errors, as well as lower false SNP discovery rates, only at the cost of computational time.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Low-Bandwidth and Non-Compute Intensive Remote Identification of Microbes from Raw Sequencing Reads

Author: AJ Cox
B Langmead
B Rost
D Devos
H Li
H Li
J Shendure
Laurent Gautier
MS Lindner
N Rusk
Ole Lund
S Hoffimann
SF Altschul
SM Rumble
T Smith
Tim J. Hubbard
WJ Kent
Z Ning
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2013
Field of study

Cheap high-throughput DNA sequencing may soon become routine not only for human genomes but also for practically anything requiring the identification of living organisms from their DNA: tracking of infectious agents, control of food products, bioreactors, or environmental samples. We propose a novel general approach to the analysis of sequencing data in which the reference genome does not have to be specified. Using a distributed architecture we are able to query a remote server for hints about what the reference might be, transferring a relatively small amount of data, and the hints can be used for more computationally-demanding work. Our system consists of a server with known reference DNA indexed, and a client with raw sequencing reads. The client sends a sample of unidentified reads, and in return receives a list of matching references known to the server. Sequences for the references can be retrieved and used for exhaustive computation on the reads, such as alignment. To demonstrate this approach we have implemented a web server, indexing tens of thousands of publicly available genomes and genomic regions from various organisms and returning lists of matching hits from query sequencing reads. We have also implemented two clients, one of them running in a web browser, in order to demonstrate that gigabytes of raw sequencing reads of unknown origin could be identified without the need to transfer a very large volume of data, and on modestly powered computing devices. A web access is available at http://tapir.cbs.dtu.dk. The source code for a python command-line client, a server, and supplementary data is available at http://bit.ly/1aURxkc

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Online Research Database In Technology

Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation

Author: A Szalkowski
A Wirawan
A Wozniak
Intel Corporation
ITS Li
JR Miller
M Farrar
O Gotoh
S Henikoff
SF Altschul
SF Altschul
SM Rumble
T Rognes
TF Smith
Torbjørn Rognes
UniProt Consortium
W Rudnicki
Y Liu
Y Liu
Ł Ligowski
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation. Results A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from <url>http://dna.uio.no/swipe/</url> under the GNU Affero General Public License. Conclusions Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

NORA - Norwegian Open Research Archives

Local alignment of two-base encoded DNA sequence

Author: A Izmailov
A Izmailov
B Ewing
B Ewing
B Ma
Barry Merriman
DR Powell
DR Smith
DS Hirschberg
EW Myers
H Li
N Jones
Nils Homer
O Gotoh
R Hamming
R Li
S Levy
SB Needleman
SF Altschul
SM Rumble
ST Sherry
Stanley F Nelson
TF Smith
VI Levenshtein
W Ewans
WJ Kent
X Huang
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background DNA sequence comparison is based on optimal local alignment of two sequences using a similarity score. However, some new DNA sequencing technologies do not directly measure the base sequence, but rather an encoded form, such as the two-base encoding considered here. In order to compare such data to a reference sequence, the data must be decoded into sequence. The decoding is deterministic, but the possibility of measurement errors requires searching among all possible error modes and resulting alignments to achieve an optimal balance of fewer errors versus greater sequence similarity. Results We present an extension of the standard dynamic programming method for local alignment, which simultaneously decodes the data and performs the alignment, maximizing a similarity score based on a weighted combination of errors and edits, and allowing an affine gap penalty. We also present simulations that demonstrate the performance characteristics of our two base encoded alignment method and contrast those with standard DNA sequence alignment under the same conditions. Conclusion The new local alignment algorithm for two-base encoded data has substantial power to properly detect and correct measurement errors while identifying underlying sequence variants, and facilitating genome re-sequencing efforts based on this form of sequence data.</p

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Author: B Langmead
Christian Otto
Cynthia M. Sharma
David B. Searls
G Myers
H Li
H Li
H Lin
JC Dohm
JM Rothberg
Jörg Hackermüller
Jörg Vogel
K Prüfer
M Crochemore
MI Abouelhoda
P Ferragina
Peter F. Stadler
Philipp Khaitovich
R Li
S Bennett
S Huse
S Karlin
SM Rumble
Stefan Kurtz
Steve Hoffmann
W Chang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

LOCAS – A Low Coverage Assembly Tool for Resequencing Projects

Author: A Doring
AR Quinlan
B Langmead
C Nusbaum
D Hernandez
D Weigel
Daniel H. Huson
DC Richter
Detlef Weigel
DR Zerbino
EW Myers
H Li
H Li
I Birol
JD Kececioglu
JO Korbel
JT Simpson
Juliane D. Klein
K Schneeberger
K Schneeberger
Korbinian Schneeberger
LE Palmer
M Pop
M Pop
MC Wendl
MJ Chaisson
PA Pevzner
R Li
R Li
RM Durbin
S Ossowski
SL Salzberg
SM Rumble
SQ Le
Stephan Ossowski
T Rausch
Ying Xu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Motivation: Next Generation Sequencing (NGS) is a frequently applied approach to detect sequence variations between highly related genomes. Recent large-scale re-sequencing studies as the Human 1000 Genomes Project utilize NGS data of low coverage to afford sequencing of hundreds of individuals. Here, SNPs and micro-indels can be detected by applying an alignment-consensus approach. However, computational methods capable of discovering other variations such as novel insertions or highly diverged sequence from low coverage NGS data are still lacking. Results: We present LOCAS, a new NGS assembler particularly designed for low coverage assembly of eukaryotic genomes using a mismatch sensitive overlap-layout-consensus approach. LOCAS assembles homologous regions in a homologyguided manner while it performs de novo assemblies of insertions and highly polymorphic target regions subsequently to an alignment-consensus approach. LOCAS has been evaluated in homology-guided assembly scenarios with low sequence coverage of Arabidopsis thaliana strains sequenced as part of the Arabidopsis 1001 Genomes Project. While assembling the same amount of long insertions as state-of-the-art NGS assemblers, LOCAS showed best results regarding contig size, error rate and runtime. Conclusion: LOCAS produces excellent results for homology-guided assembly of eukaryotic genomes with short reads and low sequencing depth, and therefore appears to be the assembly tool of choice for the detection of novel sequenc

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

MPG.PuRe

ScholarBank@NUS

Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis

Author: A Califano
AD Jayaprakash
AK Emde
Anna Tramontano
B Pasaniuc
C Della Beffa
CE Metz
D Smedley
D Weese
Francesca Cordero
GK Smyth
H Willenbrock
JH Bullard
K Prufer
KR Rasmussen
L Wang
LJ Zhu
M Farrar
M Hackenberg
M Morgan
Maddalena Arigoni
Marco Beccuti
MD Robinson
MD Robinson
MR Friedlander
R Breitling
R Ronen
R Sanges
Raffaele A. Calogero
S Alon
S Anders
S Griffiths-Jones
S Moxon
SM Rumble
Susanna Donatelli
TJ Hardcastle
V Ambros
VG Tusher
W Zheng
WC Wang
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

BACKGROUND: Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated. PRIMARY FINDINGS: A benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments was assembled by merging a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells. Using this data set we observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression. CONCLUSIONS: The results provided by our analysis allow the definition of a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Institutional Research Information System University of Turin

FigShare

Efficient Double Fragmentation ChIP-seq Provides Nucleotide Resolution Protein-DNA Binding Profiles

Author: A Siepel
A Valouev
Bähler Jürg
CL Wei
DS Johnson
ED Wederell
Edwin Cuppen
Ewart de Bruijn
FJ van Werven
G Robertson
H Ji
H Li
H Santos-Rosa
Hans Clevers
J Behrens
Jan Koster
JS Carroll
Jurian Schuijers
L Teytelman
M Molenaar
M van de Wetering
M van de Wetering
M van de Wetering
Marc van de Wetering
Michal Mokry
MJ Fullwood
P Hatzis
Pantelis Hatzis
PJ Park
PV Kharchenko
R Jothi
Rogier Versteeg
SE Halford
SM Rumble
Victor Guryev
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Immunoprecipitated crosslinked protein-DNA fragments typically range in size from several hundred to several thousand base pairs, with a significant part of chromatin being much longer than the optimal length for next-generation sequencing (NGS) procedures. Because these larger fragments may be non-random and represent relevant biology that may otherwise be missed, but also because they represent a significant fraction of the immunoprecipitated material, we designed a double-fragmentation ChIP-seq procedure. After conventional crosslinking and immunoprecipitation, chromatin is de-crosslinked and sheared a second time to concentrate fragments in the optimal size range for NGS. Besides the benefits of increased chromatin yields, the procedure also eliminates a laborious size-selection step. We show that the double-fragmentation ChIP-seq approach allows for the generation of biologically relevant genome-wide protein-DNA binding profiles from sub-nanogram amounts of TCF7L2/TCF4, TBP and H3K4me3 immunoprecipitated material. Although optimized for the AB/SOLiD platform, the same approach may be applied to other platforms

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

KNAW Repository

UvA-DARE

International Migration, Integration and Social Cohesion online publications