32,474 research outputs found
Identification and correction of systematic error in high-throughput sequence data
A feature common to all DNA sequencing technologies is the presence of base-call errors in the sequenced reads. The implications of such errors are application specific, ranging from minor informatics nuisances to major problems affecting biological inferences. Recently developed “next-gen” sequencing technologies have greatly reduced the cost of sequencing, but have been shown to be more error prone than previous technologies. Both position specific (depending on the location in the read) and sequence specific (depending on the sequence in the read) errors have been identified in Illumina and Life Technology sequencing platforms. We describe a new type of _systematic_ error that manifests as statistically unlikely accumulations of errors at specific genome (or transcriptome) locations. We characterize and describe systematic errors using overlapping paired reads form high-coverage data. We show that such errors occur in approximately 1 in 1000 base pairs, and that quality scores at systematic error sites do not account for the extent of errors. We identify motifs that are frequent at systematic error sites, and describe a classifier that distinguishes heterozygous sites from systematic error. Our classifier is designed to accommodate data from experiments in which the allele frequencies at heterozygous sites are not necessarily 0.5 (such as in the case of RNA-Seq). Systematic errors can easily be mistaken for heterozygous sites in individuals, or for SNPs in population analyses. Systematic errors are particularly problematic in low coverage experiments, or in estimates of allele-specific expression from RNA-Seq data. Our characterization of systematic error has allowed us to develop a program, called SysCall, for identifying and correcting such errors. We conclude that correction of systematic errors is important to consider in the design and interpretation of high-throughput sequencing experiments
Bioinformatics tools for analysing viral genomic data
The field of viral genomics and bioinformatics is experiencing a strong resurgence due to high-throughput sequencing (HTS) technology, which enables the rapid and cost-effective sequencing and subsequent assembly of large numbers of viral genomes. In addition, the unprecedented power of HTS technologies has enabled the analysis of intra-host viral diversity and quasispecies dynamics in relation to important biological questions on viral transmission, vaccine resistance and host jumping. HTS also enables the rapid identification of both known and potentially new viruses from field and clinical samples, thus adding new tools to the fields of viral discovery and metagenomics. Bioinformatics has been central to the rise of HTS applications because new algorithms and software tools are continually needed to process and analyse the large, complex datasets generated in this rapidly evolving area. In this paper, the authors give a brief overview of the main bioinformatics tools available for viral genomic research, with a particular emphasis on HTS technologies and their main applications. They summarise the major steps in various HTS analyses, starting with quality control of raw reads and encompassing activities ranging from consensus and de novo genome assembly to variant calling and metagenomics, as well as RNA sequencing
PREMIER - PRobabilistic Error-correction using Markov Inference in Errored Reads
In this work we present a flexible, probabilistic and reference-free method
of error correction for high throughput DNA sequencing data. The key is to
exploit the high coverage of sequencing data and model short sequence outputs
as independent realizations of a Hidden Markov Model (HMM). We pose the problem
of error correction of reads as one of maximum likelihood sequence detection
over this HMM. While time and memory considerations rule out an implementation
of the optimal Baum-Welch algorithm (for parameter estimation) and the optimal
Viterbi algorithm (for error correction), we propose low-complexity approximate
versions of both. Specifically, we propose an approximate Viterbi and a
sequential decoding based algorithm for the error correction. Our results show
that when compared with Reptile, a state-of-the-art error correction method,
our methods consistently achieve superior performances on both simulated and
real data sets.Comment: Submitted to ISIT 201
Simultaneous mapping of multiple gene loci with pooled segregants
The analysis of polygenic, phenotypic characteristics such as quantitative traits or inheritable diseases remains an important challenge. It requires reliable scoring of many genetic markers covering the entire genome. The advent of high-throughput sequencing technologies provides a new way to evaluate large numbers of single nucleotide polymorphisms (SNPs) as genetic markers. Combining the technologies with pooling of segregants, as performed in bulked segregant analysis (BSA), should, in principle, allow the simultaneous mapping of multiple genetic loci present throughout the genome. The gene mapping process, applied here, consists of three steps: First, a controlled crossing of parents with and without a trait. Second, selection based on phenotypic screening of the offspring, followed by the mapping of short offspring sequences against the parental reference. The final step aims at detecting genetic markers such as SNPs, insertions and deletions with next generation sequencing (NGS). Markers in close proximity of genomic loci that are associated to the trait have a higher probability to be inherited together. Hence, these markers are very useful for discovering the loci and the genetic mechanism underlying the characteristic of interest. Within this context, NGS produces binomial counts along the genome, i.e., the number of sequenced reads that matches with the SNP of the parental reference strain, which is a proxy for the number of individuals in the offspring that share the SNP with the parent. Genomic loci associated with the trait can thus be discovered by analyzing trends in the counts along the genome. We exploit the link between smoothing splines and generalized mixed models for estimating the underlying structure present in the SNP scatterplots
DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing
We consider the correction of errors from nucleotide sequences produced by
next-generation targeted amplicon sequencing. The next-generation sequencing
(NGS) platforms can provide a great deal of sequencing data thanks to their
high throughput, but the associated error rates often tend to be high.
Denoising in high-throughput sequencing has thus become a crucial process for
boosting the reliability of downstream analyses. Our methodology, named
DUDE-Seq, is derived from a general setting of reconstructing finite-valued
source data corrupted by a discrete memoryless channel and effectively corrects
substitution and homopolymer indel errors, the two major types of sequencing
errors in most high-throughput targeted amplicon sequencing platforms. Our
experimental studies with real and simulated datasets suggest that the proposed
DUDE-Seq not only outperforms existing alternatives in terms of
error-correction capability and time efficiency, but also boosts the
reliability of downstream analyses. Further, the flexibility of DUDE-Seq
enables its robust application to different sequencing platforms and analysis
pipelines by simple updates of the noise model. DUDE-Seq is available at
http://data.snu.ac.kr/pub/dude-seq
Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data
There is a high prevalence of coronary artery disease (CAD) in patients with left bundle branch block (LBBB); however there are many other causes for this electrocardiographic abnormality. Non-invasive assessment of these patients remains difficult, and all commonly used modalities exhibit several drawbacks. This often leads to these patients undergoing invasive coronary angiography which may not have been necessary. In this review, we examine the uses and limitations of commonly performed non-invasive tests for diagnosis of CAD in patients with LBBB
HiTRACE-Web: an online tool for robust analysis of high-throughput capillary electrophoresis
To facilitate the analysis of large-scale high-throughput capillary
electrophoresis data, we previously proposed a suite of efficient analysis
software named HiTRACE (High Throughput Robust Analysis of Capillary
Electrophoresis). HiTRACE has been used extensively for quantitating data from
RNA and DNA structure mapping experiments, including mutate-and-map contact
inference, chromatin footprinting, the EteRNA RNA design project and other
high-throughput applications. However, HiTRACE is based on a suite of
command-line MATLAB scripts that requires nontrivial efforts to learn, use, and
extend. Here we present HiTRACE-Web, an online version of HiTRACE that includes
standard features previously available in the command-line version as well as
additional features such as automated band annotation and flexible adjustment
of annotations, all via a user-friendly environment. By making use of
parallelization, the on-line workflow is also faster than software
implementations available to most users on their local computers. Free access:
http://hitrace.or
The White Dwarf Distance to the Globular Cluster 47 Tucanae and its Age
We present a new determination of the distance (and age) of the Galactic
globular cluster 47 Tucanae (NGC 104) based on the fit of its white dwarf (WD)
cooling sequence with the empirical fiducial sequence of local WD with known
trigonometric parallax, following the method described in Renzini et al.
(1996). Both the cluster and the local WDs were imaged with HST+WFPC2 using the
same instrument setup. We obtained an apparent distance modulus of
consistent with previous ground-based determinations and
shorter than that found using HIPPARCOS subdwarfs. Coupling our distance
determination with a new measure of the apparent magnitude of the main sequence
turnoff, based on our HST data, we derive an age of Gyr.Comment: Accepted for publication on the Astrophysical Journa
- …