Search CORE

821 research outputs found

BARCRAWL and BARTAB: software tools for the design and implementation of barcoded primers for highly multiplexed DNA sequencing

Author: DA Peterson
Daniel N Frank
DJ Lane
DN Frank
DN Frank
ER Mardis
M Hamady
M Margulies
M Meyer
P Parameswaran
R Team DC
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Advances in automated DNA sequencing technology have greatly increased the scale of genomic and metagenomic studies. An increasingly popular means of increasing project throughput is by multiplexing samples during the sequencing phase. This can be achieved by covalently linking short, unique "barcode" DNA segments to genomic DNA samples, for instance through incorporation of barcode sequences in PCR primers. Although several strategies have been described to insure that barcode sequences are unique and robust to sequencing errors, these have not been integrated into the overall primer design process, thus potentially introducing bias into PCR amplification and/or sequencing steps. Results <it>Barcrawl </it>is a software program that facilitates the design of barcoded primers, for multiplexed high-throughput sequencing. The program <it>bartab </it>can be used to deconvolute DNA sequence datasets produced by the use of multiple barcoded primers. This paper describes the functions implemented by <it>barcrawl </it>and <it>bartab </it>and presents a proof-of-concept case study of both programs in which barcoded rRNA primers were designed and validated by high-throughput sequencing. Conclusion <it>Barcrawl </it>and <it>bartab </it>can benefit researchers who are engaged in metagenomic projects that employ multiplexed specimen processing. The source code is released under the GNU general public license and can be accessed at <url>http://www.phyloware.com</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Investigation into the annotation of protocol sequencing steps in the sequence read archive

Author: A Brazma
A Brazma
A Seguin-Orlando
ER Mardis
ER Mardis
F Meacham
I Kozarewa
J Housby
J Orlowski
JA Sikorsky
JC Dohm
JH Eastberg
JR Miller
KD Hansen
M Allhoff
MA Quail
MG Ross
ML Metzker
MS Cheung
N Kamps-Hughes
P Keohavong
R Edgar
R Leinonen
S Spitaleri
SG Acinas
SL Schwartz
T Nakazato
X Jiao
YC Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND: The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. RESULTS: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records). CONCLUSIONS: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present

Crossref

Springer - Publisher Connector

Royal Holloway - Pure

PubMed Central

Spiral - Imperial College Digital Repository

Next generation sequencing has lower sequence coverage and poorer SNP-detection capability in the regulatory regions

Author: A McKenna
A von Bubnoff
AD Smith
AD Smith
AR Quinlan
B Langmead
D Weese
DC Koboldt
ER Mardis
ER Mardis
ER Martin
F Antequera
F Sanger
G Basti
GT Marth
H Jiang
H Li
H Li
H Li
H Li
H Li
H Lin
HL Eaves
JW Wang
L Bonetta
M David
N Homer
N Malhis
O Harismendy
P Flicek
PJA Cock
R Goya
R McLendon
RQ Li
RQ Li
S Graf
SC Schuster
SF Altschul
SM Rumble
SP Shah
V Bansal
WJ Kent
YF Shen
Publication venue: Nature Publishing Group
Publication date: 01/01/2011
Field of study

The rapid development of next generation sequencing (NGS) technology provides a new chance to extend the scale and resolution of genomic research. How to efficiently map millions of short reads to the reference genome and how to make accurate SNP calls are two major challenges in taking full advantage of NGS. In this article, we reviewed the current software tools for mapping and SNP calling, and evaluated their performance on samples from The Cancer Genome Atlas (TCGA) project. We found that BWA and Bowtie are better than the other alignment tools in comprehensive performance for Illumina platform, while NovoalignCS showed the best overall performance for SOLiD. Furthermore, we showed that next-generation sequencing platform has significantly lower coverage and poorer SNP-calling performance in the CpG islands, promoter and 5′-UTR regions of the genome. NGS experiments targeting for these regions should have higher sequencing depth than the normal genomic region

Crossref

PubMed Central

HKU Scholars Hub

NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data

Author: A Martinez-Alcantara
D Blankenberg
ER Mardis
M Margulies
M Morgan
MP Cox
Mukesh Jain
PJA Cock
R Garg
R Garg
R Schmieder
R Schmieder
Ravi K. Patel
RV Pandey
T Lassmann
Z Wang
Zhanjiang Liu
Publication venue: Public Library of Science
Publication date: 01/02/2012
Field of study

Next generation sequencing (NGS) technologies provide a high-throughput means to generate large amount of sequence data. However, quality control (QC) of sequence data generated from these technologies is extremely important for meaningful downstream analysis. Further, highly efficient and fast processing tools are required to handle the large volume of datasets. Here, we have developed an application, NGS QC Toolkit, for quality check and filtering of high-quality data. This toolkit is a standalone and open source application freely available at http://www.nipgr.res.in/ngsqctoolkit.html. All the tools in the application have been implemented in Perl programming language. The toolkit is comprised of user-friendly tools for QC of sequencing data generated using Roche 454 and Illumina platforms, and additional tools to aid QC (sequence format converter and trimming tools) and analysis (statistics tools). A variety of options have been provided to facilitate the QC at user-defined parameters. The toolkit is expected to be very useful for the QC of NGS data to facilitate better downstream analysis

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Phylogenetic comparative assembly

Author: A Tauch
A Tauch
D Gordon
DA Benson
DC Richter
DL Wheeler
ER Gansner
ER Mardis
F Sanger
F Zhao
J Blom
J Fredslund
Jens Stoye
JL Bentley
KR Rasmussen
M Pop
N Saitou
Peter Husemann
R Staden
S Altschul
S Anderson
S Kurtz
SAFT van Hijum
WJ Kent
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Husemann P, Stoye J. Phylogenetic Comparative Assembly. Algorithms for Molecular Biology. 2010;5(1): 3.BACKGROUND:Recent high throughput sequencing technologies are capable of generating a huge amount of data for bacterial genome sequencing projects. Although current sequence assemblers successfully merge the overlapping reads, often several contigs remain which cannot be assembled any further. It is still costly and time consuming to close all the gaps in order to acquire the whole genomic sequence. RESULTS:Here we propose an algorithm that takes several related genomes and their phylogenetic relationships into account to create a graph that contains the likelihood for each pair of contigs to be adjacent. Subsequently, this graph can be used to compute a layout graph that shows the most promising contig adjacencies in order to aid biologists in finishing the complete genomic sequence. The layout graph shows unique contig orderings where possible, and the best alternatives where necessary. CONCLUSIONS:Our new algorithm for contig ordering uses sequence similarity as well as phylogenetic information to estimate adjacencies of contigs. An evaluation of our implementation shows that it performs better than recent approaches while being much faster at the same tim

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

PheMaDB: A solution for storage, retrieval, and analysis of high throughput phenotype data

Author: BE Turk
Brandon W Higgs
Carol E Chapman
ER Mardis
Keri Sarver
Kimberly A Bishop-Lilly
M Mols
Nichole ME Nolan
R Development Core Team
S Sozhamannan
Shanmuga Sozhamannan
Timothy D Read
Wenling E Chang
Z Beharry
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background OmniLog™ phenotype microarrays (PMs) have the capability to measure and compare the growth responses of biological samples upon exposure to hundreds of growth conditions such as different metabolites and antibiotics over a time course of hours to days. In order to manage the large amount of data produced from the OmniLog™ instrument, PheMaDB (Phenotype Microarray DataBase), a web-based relational database, was designed. PheMaDB enables efficient storage, retrieval and rapid analysis of the OmniLog™ PM data. Description PheMaDB allows the user to quickly identify records of interest for data analysis by filtering with a hierarchical ordering of Project, Strain, Phenotype, Replicate, and Temperature. PheMaDB then provides various statistical analysis options to identify specific growth pattern characteristics of the experimental strains, such as: outlier analysis, negative controls analysis (signal/background calibration), bar plots, pearson's correlation matrix, growth curve profile search, <it>k</it>-means clustering, and a heat map plot. This web-based database management system allows for both easy data sharing among multiple users and robust tools to phenotype organisms of interest. Conclusions PheMaDB is an open source system standardized for OmniLog™ PM data. PheMaDB could facilitate the banking and sharing of phenotype data. The source code is available for download at <url>http://phemadb.sourceforge.net</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PanGEA: Identification of allele specific gene expression using the 454 technology

Author: AP Weber
Christian Schlötterer
ER Mardis
M Margulies
M Pop
O Gotoh
Robert Kofler
SC Schuster
SF Altschul
SM Huse
Tamas Lelley
Tatiana Teixeira Torres
TD Harris
TF Smith
TT Torres
W Brockman
WR Pearson
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Next generation sequencing technologies hold great potential for many biological questions. While mainly used for genomic sequencing, they are also very promising for gene expression profiling. Sequencing of cDNA does not only provide an estimate of the absolute expression level, it can also be used for the identification of allele specific gene expression. Results We developed PanGEA, a tool which enables a fast and user-friendly analysis of allele specific gene expression using the 454 technology. PanGEA allows mapping of 454-ESTs to genes or whole genomes, displaying gene expression profiles, identification of SNPs and the quantification of allele specific gene expression. The intuitive GUI of PanGEA facilitates a flexible and interactive analysis of the data. PanGEA additionally implements a modification of the Smith-Waterman algorithm which deals with incorrect estimates of homopolymer length as occuring in the 454 technology Conclusion To our knowledge, PanGEA is the first tool which facilitates the identification of allele specific gene expression. PanGEA is distributed under the Mozilla Public License and available at: <url>http://www.kofler.or.at/bioinformatics/PanGEA</url></p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publikationsserver der Universitätsbibliothek Bodenkultur Wien

Comprehensive Survey of SNPs in the Affymetrix Exon Array Using the 1000 Genomes Dataset

Author: BE Stranger
BE Stranger
CM Hartford
D Benovoy
E Gamazon
E Gamazon
E Sliwerska
E Tantoso
ER Mardis
ER Mardis
Eric R. Gamazon
F Takeuchi
H Auer
JA Taylor
Janet Kelso
KA Frazer
KD Pruitt
M Morley
M Welsh
M. Eileen Dolan
Nancy J. Cox
O Harismendy
R Alberts
RA Irizarry
RS Huang
RS Huang
RS Huang
RS Huang
RS Spielman
S Duan
S Duan
SA Tishkoff
T Kwan
W Zhang
W Zhang
W Zhang
W Zhang
W Zhang
Wei Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Microarray gene expression data has been used in genome-wide association studies to allow researchers to study gene regulation as well as other complex phenotypes including disease risks and drug response. To reach scientifically sound conclusions from these studies, however, it is necessary to get reliable summarization of gene expression intensities. Among various factors that could affect expression profiling using a microarray platform, single nucleotide polymorphisms (SNPs) in target mRNA may lead to reduced signal intensity measurements and result in spurious results. The recently released 1000 Genomes Project dataset provides an opportunity to evaluate the distribution of both known and novel SNPs in the International HapMap Project lymphoblastoid cell lines (LCLs). We mapped the 1000 Genomes Project genotypic data to the Affymetrix GeneChip Human Exon 1.0ST array (exon array), which had been used in our previous studies and for which gene expression data had been made publicly available. We also evaluated the potential impact of these SNPs on the differentially spliced probesets we had identified previously. Though the 1000 Genomes Project data allowed a comprehensive survey of the SNPs in this particular array, the same approach can certainly be applied to other microarray platforms. Furthermore, we present a detailed catalogue of SNP-containing probesets (exon-level) and transcript clusters (gene-level), which can be considered in evaluating findings using the exon array as well as benefit the design of follow-up experiments and data re-analysis

CiteSeerX

Public Library of Science (PLOS)

Crossref

PubMed Central

Compression of Structured High-Throughput Sequencing Data

Author: ER Mardis
Fabien Campagne
Frederique Lisacek
H Li
H Li
James T. Robinson
Jill P. Mesirov
JK Pickrell
JR Shearstone
JT Robinson
Kevin C. Dorff
L Skrabanek
M Hsi-Yang Fritz
M Mangone
N Agrawal
N Popitsch
Nyasha Chambwe
SM Kielbasa
TD Wu
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 28/11/2012
Field of study

Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

DSpace@MIT

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

FigShare

Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment

Author: A Löytynoja
A Löytynoja
B Sipos
BG Hall
BG Hall
BP Blackburne
C Chothia
C Dessimoz
C Kemena
C Kemena
C Notredame
CB Do
CL Strope
DA Dalquen
DA Morrison
DH Mathews
ER Mardis
G Blackshields
G Jordan
G Landan
GP Raghava
I Walle Van
J Kim
J Stoye
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JD Thompson
JH Havgaard
JP Huelsenbeck
K Mizuguchi
LA Stebbings
M Anisimova
M Pop
MR Aniba
P Gardner
RA Cartwright
RB Russell
RC Edgar
RC Edgar
SA Berger
SF Altschul
T Golubchik
T Koestler
T Lassmann
T Lassmann
T Lassmann
W Fletcher
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/11/2012
Field of study

Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique in bioinformatics used to infer related residues among biological sequences. Thus alignment accuracy is crucial to a vast range of analyses, often in ways difficult to assess in those analyses. To compare the performance of different aligners and help detect systematic errors in alignments, a number of benchmarking strategies have been pursued. Here we present an overview of the main strategies--based on simulation, consistency, protein structure, and phylogeny--and discuss their different advantages and associated risks. We outline a set of desirable characteristics for effective benchmarking, and evaluate each strategy in light of them. We conclude that there is currently no universally applicable means of benchmarking MSA, and that developers and users of alignment tools should base their choice of benchmark depending on the context of application--with a keen awareness of the assumptions underlying each benchmarking strategy.Comment: Revie

arXiv.org e-Print Archive

Crossref

UCL Discovery