Search CORE

JANE: efficient mapping of prokaryotic ESTs and variable length sequence reads on related template genomes

Author: A Raghunathan
AD Smith
Alexander Schmid
Andres Moya
AR Subramanian
B Langmead
B Morgenstern
BP Howden
C Liang
CA Hutchison III
Chunguang Liang
CJ Sigrist
D Gilbert
DW Mount
E Birney
E Gaidos
ER Xavier
F Meyer
GS Slater
H Jiang
H Li
JE Stajich
JP McCutcheon
Jörg Bernhardt
M Krzywinski
María José López-Sánchez
N Sanapareddy
R Li
R Mott
R Seshadri
R Wernersson
RK Aziz
RL Tatusov
Roy Gross
S Stoll
S Yang
SF Altschul
SF Altschul
T Smith
Thomas Dandekar
X Huang
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Infoscience - École polytechnique fédérale de Lausanne

Probabilistic base calling of Solexa sequencing data

Author: Amzallag Arnaud
Farinelli Laurent
Iseli Christian
Naef Felix
Rougemont Jacques
Xenarios Ioannis
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

BACKGROUND: Solexa/Illumina short-read ultra-high throughput DNA sequencing technology produces millions of short tags (up to 36 bases) by parallel sequencing-by-synthesis of DNA colonies. The processing and statistical analysis of such high-throughput data poses new challenges; currently a fair proportion of the tags are routinely discarded due to an inability to match them to a reference sequence, thereby reducing the effective throughput of the technology. RESULTS: We propose a novel base calling algorithm using model-based clustering and probability theory to identify ambiguous bases and code them with IUPAC symbols. We also select optimal sub-tags using a score based on information content to remove uncertain bases towards the ends of the reads. CONCLUSION: We show that the method improves genome coverage and number of usable tags as compared with Solexa's data processing pipeline by an average of 15%. An R package is provided which allows fast and accurate base calling of Solexa's fluorescence intensity files and the production of informative diagnostic plots

Serveur académique lausannois

Evaluation of next-generation sequencing software in mapping and assembly

Author: A Bashir
A Bateman
AC McHardy
AD Smith
B Langmead
BinBin Wang
C Trapnell
CA Tilford
D Campagna
D Hernandez
D Weese
DR Bentley
DR Zerbino
DS Horner
DW Bryant Jr
ER Mardis
ER Mardis
ES Lander
EW Myers
F Sanger
H Jiang
H Li
H Li
H Li
H Lin
HL Eaves
J Butler
JC Dohm
JC Venter
JO Korbel
JR Miller
JR Miller
JT Simpson
JT Simpson
K Chen
KE Holt
L Engstrand
L Noe
M Margulies
M Pop
M Pop
MC Schatz
MJ Chaisson
ML Metzker
MS Hossain
N Homer
N Malhis
NL Clement
O Morozova
O Morozova
P Flicek
P Flicek
P Medvedev
PA Pevzner
PJ Campbell
PJ Hurd
R Staden
RF Service
RL Warren
RQ Li
RQ Li
Rui Jiang
SC Schuster
SM Rumble
Suying Bao
WingKeung Kwan
WJ Ansorge
WR Jeck
Xu Ma
Y Chen
YJ Kim
You-Qiang Song
Z Ning
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Next-generation high-throughput DNA sequencing technologies have advanced progressively in sequence-based genomic research and novel biological applications with the promise of sequencing DNA at unprecedented speed. These new non-Sanger-based technologies feature several advantages when compared with traditional sequencing methods in terms of higher sequencing speed, lower per run cost and higher accuracy. However, reads from next-generation sequencing (NGS) platforms, such as 454/Roche, ABI/SOLiD and Illumina/Solexa, are usually short, thereby restricting the applications of NGS platforms in genome assembly and annotation. We presented an overview of the challenges that these novel technologies meet and particularly illustrated various bioinformatics attempts on mapping and assembly for problem solving. We then compared the performance of several programs in these two fields, and further provided advices on selecting suitable tools for specific biological applications.published_or_final_versio

HKU Scholars Hub

BOAT: Basic Oligonucleotide Alignment Tool

Author: Gao Ge
Gu Xiaocheng
Li Jiong-Tang
Wang Jun
Wei Liping
Zhang Li
Zhao Shu-Qi
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Next-generation DNA sequencing technologies generate tens of millions of sequencing reads in one run. These technologies are now widely used in biology research such as in genome-wide identification of polymorphisms, transcription factor binding sites, methylation states, and transcript expression profiles. Mapping the sequencing reads to reference genomes efficiently and effectively is one of the most critical analysis tasks. Although several tools have been developed, their performance suffers when both multiple substitutions and insertions/deletions (indels) occur together. Results We report a new algorithm, Basic Oligonucleotide Alignment Tool (BOAT) that can accurately and efficiently map sequencing reads back to the reference genome. BOAT can handle several substitutions and indels simultaneously, a useful feature for identifying SNPs and other genomic structural variations in functional genomic studies. For better handling of low-quality reads, BOAT supports a "3'-end Trimming Mode" to build local optimized alignment for sequencing reads, further improving sensitivity. BOAT calculates an E-value for each hit as a quality assessment and provides customizable post-mapping filters for further mapping quality control. Conclusion Evaluations on both real and simulation datasets suggest that BOAT is capable of mapping large volumes of short reads to reference sequences with better sensitivity and lower memory requirement than other currently existing algorithms. The source code and pre-compiled binary packages of BOAT are publicly available for download at <url>http://boat.cbi.pku.edu.cn</url> under GNU Public License (GPL). BOAT can be a useful new tool for functional genomics studies.</p

High quality draft sequences for prokaryotic genomes using a mix of new sequencing technologies

Author: Anthouard Véronique
Artiguenave François
Aury Jean-Marc
Barbe Valérie
Cruaud Corinne
Mangenot Sophie
Poulain Julie
Rogier Odile
Samson Gaelle
Scarpelli Claude
Wincker Patrick
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Massively parallel DNA sequencing instruments are enabling the decoding of whole genomes at significantly lower cost and higher throughput than classical Sanger technology. Each of these technologies have been estimated to yield assemblies with more problematic features than the standard method. These problems are of a different nature depending on the techniques used. So, an appropriate mix of technologies may help resolve most difficulties, and eventually provide assemblies of high quality without requiring any Sanger-based input. Results We compared assemblies obtained using Sanger data with those from different inputs from New Sequencing Technologies. The assemblies were systematically compared with a reference finished sequence. We found that the 454 GSFLX can efficiently produce high continuity when used at high coverage. The potential to enhance continuity by scaffolding was tested using 454 sequences from circularized genomic fragments. Finally, we explore the use of Solexa-Illumina short reads to polish the genome draft by implementing a technique to correct 454 consensus errors. Conclusion High quality drafts can be produced for small genomes without any Sanger data input. We found that 454 GSFLX and Solexa/Illumina show great complementarity in producing large contigs and supercontigs with a low error rate.</p

HAL Evry

HAL-CEA

A new strategy for better genome assembly from very short reads

Author: Ding Guohui
Ji Yan
Li Yixue
Shi Yixiang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background With the rapid development of the next generation sequencing (NGS) technology, large quantities of genome sequencing data have been generated. Because of repetitive regions of genomes and some other factors, assembly of very short reads is still a challenging issue. Results A novel strategy for improving genome assembly from very short reads is proposed. It can increase accuracies of assemblies by integrating <it>de novo </it>contigs, and produce comparative contigs by allowing multiple references without limiting to genomes of closely related strains. Comparative contigs are used to scaffold <it>de novo </it>contigs. Using simulated and real datasets, it is shown that our strategy can effectively improve qualities of assemblies of isolated microbial genomes and metagenomes. Conclusions With more and more reference genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at <url>http://code.google.com/p/cd-hybrid/</url>.</p

Public Library of Science (PLOS)

Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

Author: A McCarthy
A Morgulis
AD Smith
B Langmead
C Camacho
D Willner
D Willner
DA Wheeler
DJ Turner
DR Bentley
EA Dinsdale
ES Lander
F Meyer
Francisco Rodriguez-Valera
FS Collins
GL Rosen
H Li
H Li
H Li
H Li
J Eid
J Peterson
J Qin
J Wang
JC Venter
JC Wooley
JM Kidd
K Mavromatis
KA Frazer
ML Metzker
N Homer
P Ferragina
P Flicek
P Hugenholtz
PJ Turnbaugh
PJ Turnbaugh
PJA Cock
R Li
R Li
R Schmieder
R Schmieder
Robert Edwards
Robert Schmieder
RP Alexander
S Ahn
S Huse
S Kurtz
S Levy
SF Altschul
SF Altschul
SG Tringe
TF Smith
V Kunin
WJ Kent
WJ Kent
Y Li
Z Ning
Publication venue: Public Library of Science
Publication date: 09/03/2011
Field of study

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/