29 research outputs found
Estimation of Sequencing Error Rates in Short Reads
Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/âŒvwang/shadowRegression.html. Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data
Recommended from our members
Next-generation sequencing of dsRNA is greatly improved by treatment with the inexpensive denaturing reagent DMSO.
dsRNA is the genetic material of important viruses and a key component of RNA interference-based immunity in eukaryotes. Previous studies have noted difficulties in determining the sequence of dsRNA molecules that have affected studies of immune function and estimates of viral diversity in nature. DMSO has been used to denature dsRNA prior to the reverse-transcription stage to improve reverse transcriptase PCR and Sanger sequencing. We systematically tested the utility of DMSO to improve the sequencing yield of a dsRNA virus (Ί6) in a short-read next-generation sequencing platform. DMSO treatment improved sequencing read recovery by over two orders of magnitude, even when RNA and cDNA concentrations were below the limit of detection. We also tested the effects of DMSO on a mock eukaryotic viral community and found that dsRNA virus reads increased with DMSO treatment. Furthermore, we provide evidence that DMSO treatment does not adversely affect recovery of reads from a ssRNA viral genome (influenza A/California/07/2009). We suggest that up to 50â% DMSO treatment be used prior to cDNA synthesis when samples of interest are composed of or may contain dsRNA
DRISEE overestimates errors in metagenomic sequencing data
© The Author(s), 2013. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Briefings in Bioinformatics 15 (2014): 783-787, doi:10.1093/bib/bbt010.The extremely high error rates reported by Keegan et al. in âA platform-independent method for detecting errors in metagenomic sequencing data: DRISEEâ (PLoS Comput Biol 2012;8:e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.National Institutes of Health [1UH2DK083993 to
M.L.S.]; National Science Foundation [BDI-
096026 to S.M.H.]
Next-generation sequencing : an eye-opener for the surveillance of antiviral resistance in influenza
Next-generation sequencing (NGS) can enable a more effective response to a wide range of communicable disease threats, such as influenza, which is one of the leading causes of human morbidity and mortality worldwide. After vaccination, antivirals are the second line of defense against influenza. The use of currently available antivirals can lead to antiviral resistance mutations in the entire influenza genome. Therefore, the methods to detect these mutations should be developed and implemented. In this Opinion, we assess how NGS could be implemented to detect drug resistance mutations in clinical influenza virus isolates
Maximize Resolution or Minimize Error? Using Genotyping-By-Sequencing to Investigate the Recent Diversification of Helianthemum (Cistaceae)
A robust phylogenetic framework, in terms of extensive geographical and taxonomic sampling, well-resolved species relationships and high certainty of tree topologies and branch length estimations, is critical in the study of macroevolutionary patterns. Whereas Sanger sequencing-based methods usually recover insufficient phylogenetic signal, especially in recently diversified lineages, reduced-representation sequencing methods tend to provide well-supported phylogenetic relationships, but usually entail remarkable bioinformatic challenges due to the inherent trade-off between the number of SNPs and the magnitude of associated error rates. The genus Helianthemum (Cistaceae) is a species-rich and taxonomically complex Palearctic group of plants that diversified mainly since the Upper Miocene. It is a challenging case study since previous attempts using Sanger sequencing were unable to resolve the intrageneric phylogenetic relationships. Aiming to obtain a robust phylogenetic reconstruction based on genotyping-by-sequencing (GBS), we established a rigorous methodological workflow in which we i) explored how variable settings during dataset assembly have an impact on error rates and on the degree of resolution under concatenation and coalescent approaches, ii) assessed the effect of two extreme parameter configurations (minimizing error rates vs. maximizing phylogenetic resolution) on tree topology and branch lengths, and iii) evaluated the effects of these two configurations on estimates of divergence times and diversification rates. Our analyses produced highly supported topologically congruent phylogenetic trees for both configurations. However, minimizing error rates did produce more reliable branch lengths, critically affecting the accuracy of downstream analyses (i.e. divergence times and diversification rates). In addition to recommending a revision of intrageneric systematics, our results enabled us to identify three highly diversified lineages in Helianthemum in contrasting geographical areas and ecological conditions, which started radiating in the Upper Miocene.España, MINECO grants CGL2014- 52459-P and CGL2017-82465-PEspaña, Ministerio de EconomĂa, Industria y Competitividad, reference IJCI-2015-2345
Clinical Genetic Testing in Children with Kidney Disease
Chronic kidney disease, the presence of structural and functional abnormalities in the kidneys, is associated with a lower quality of life and increased morbidity and mortality in children. Genetic etiologies account for a substantial proportion of pediatric chronic kidney disease. With recent advances in genetic testing techniques, an increasing number of genetic causes of kidney disease continue to be found. Genetic testing is recommended in children with steroid-resistant nephrotic syndrome, congenital malformations of the kidney and urinary tract, cystic disease, or kidney disease with extrarenal manifestations. Diagnostic yields differ according to the category of clinical diagnosis and the choice of test. Here, we review the characteristics of genetic testing modalities and the implications of genetic testing in clinical genetic diagnostics
The detection of high-qualified indels in exomes and their effect on cognition
Plusieurs insertions/délétions (indels) génétiques ont été identifiées en lien avec des troubles du
neurodĂ©veloppement, notamment le trouble du spectre de lâautisme (TSA) et la dĂ©ficience
intellectuelle (DI). Bien que ce soit le deuxiÚme type de variant le plus courant, la détection et
lâidentification des indels demeure difficile Ă ce jour, et on y retrouve un grand nombre de faux
positifs. Ce projet vise à trouver une méthode pour détecter des indels de haute qualité ayant une
forte probabilitĂ© dâĂȘtre des vrais positifs.
Un « ensemble de vĂ©ritĂ© » a Ă©tĂ© construit Ă partir dâindels provenant de deux cohortes familiales
basĂ© sur un diagnostic dâautisme. Ces indels ont Ă©tĂ© filtrĂ©s selon un ensemble de paramĂštres
prĂ©dĂ©terminĂ©s et ils ont Ă©tĂ© appelĂ©s par plusieurs outils dâappel de variants. Cet ensemble a Ă©tĂ©
utilisĂ© pour entraĂźner trois modĂšles dâapprentissage automatique pour identifier des indels de haute
qualité. Par la suite, nous avons utilisé ces modÚles pour prédire des indels de haute qualité dans
une cohorte de population gĂ©nĂ©rale, ayant Ă©tĂ© appelĂ© par une technologie dâappel de variant.
Les modÚles ont pu identifier des indels de meilleure qualité qui ont une association avec le QI,
malgré que cet effet soit petit. De plus, les indels prédits par les modÚles affectent un plus petit
nombre de gÚnes par individu que ceux ayant été filtrés par un seuil de rejet fixe. Les modÚles ont
tendance à améliorer la qualité des indels, mais nécessiteront davantage de travail pour déterminer
si ce serait possible de prédire les indels qui ont un effet non-négligeable sur le QI.Genetic insertions/deletions (indels) have been linked to many neurodevelopmental
disorders (NDDs) such as autism spectrum disorder (ASD) and intellectual disability (ID).
However, although they are the second most common type of genetic variant, they remain to this
day difficult to identify and verify, presenting a high number of false positives. We sought to find
a method that would appropriately identify high-quality indels that are likely to be true positives.
We built an indel âtruth setâ using indels from two diagnosis-based family cohorts that
were filtered according to a set of threshold values and called by several variant calling tools in
order to train three machine learning models to identify the highest quality indels. The two best
performing models were then used to identify high quality indels in a general population cohort
that was called using only one variant calling technology.
The machine learning models were able to identify higher quality indels that showed a
association with IQ, although the effect size was small. The indels predicted by the models also
affected a much smaller number of genes per individual than those predicted through using
minimum thresholds alone. The models tend to show an overall improvement in the quality of the
indels but would require further work to see if it could a noticeable and significant effect on IQ