32,523 research outputs found
Jabba: hybrid error correction for long sequencing reads using maximal exact matches
Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented
Error correction and diversity analysis of population mixtures determined by NGS
The impetus for this work was the need to analyse nucleotide diversity in a viral mix taken from honeybees. The paper has two findings. First, a method for correction of next generation sequencing error in the distribution of nucleotides at a site is developed. Second, a package of methods for assessment of nucleotide diversity is assembled. The error correction method is statistically based and works at the level of the nucleotide distribution rather than the level of individual nucleotides. The method relies on an error model and a sample of known viral genotypes that is used for model calibration. A compendium of existing and new diversity analysis tools is also presented, allowing hypotheses about diversity and mean diversity to be tested and associated confidence intervals to be calculated. The methods are illustrated using honeybee viral samples. Software in both Excel and Matlab and a guide are available at http://www2.warwick.ac.uk/fac/sci/systemsbiology/research/software/,the Warwick University Systems Biology Centre software download site.Publisher PDFPeer reviewe
Jabba: hybrid error correction for long sequencing reads
Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.
Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.
Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph
DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing
We consider the correction of errors from nucleotide sequences produced by
next-generation targeted amplicon sequencing. The next-generation sequencing
(NGS) platforms can provide a great deal of sequencing data thanks to their
high throughput, but the associated error rates often tend to be high.
Denoising in high-throughput sequencing has thus become a crucial process for
boosting the reliability of downstream analyses. Our methodology, named
DUDE-Seq, is derived from a general setting of reconstructing finite-valued
source data corrupted by a discrete memoryless channel and effectively corrects
substitution and homopolymer indel errors, the two major types of sequencing
errors in most high-throughput targeted amplicon sequencing platforms. Our
experimental studies with real and simulated datasets suggest that the proposed
DUDE-Seq not only outperforms existing alternatives in terms of
error-correction capability and time efficiency, but also boosts the
reliability of downstream analyses. Further, the flexibility of DUDE-Seq
enables its robust application to different sequencing platforms and analysis
pipelines by simple updates of the noise model. DUDE-Seq is available at
http://data.snu.ac.kr/pub/dude-seq
- …