29 research outputs found

    Estimation of Sequencing Error Rates in Short Reads

    Get PDF
    Background: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. Results: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and a reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/∌vwang/shadowRegression.html. Conclusions: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data

    DRISEE overestimates errors in metagenomic sequencing data

    Get PDF
    © The Author(s), 2013. This article is distributed under the terms of the Creative Commons Attribution License. The definitive version was published in Briefings in Bioinformatics 15 (2014): 783-787, doi:10.1093/bib/bbt010.The extremely high error rates reported by Keegan et al. in ‘A platform-independent method for detecting errors in metagenomic sequencing data: DRISEE’ (PLoS Comput Biol 2012;8:e1002541) for many next-generation sequencing datasets prompted us to re-examine their results. Our analysis reveals that the presence of conserved artificial sequences, e.g. Illumina adapters, and other naturally occurring sequence motifs accounts for most of the reported errors. We conclude that DRISEE reports inflated levels of sequencing error, particularly for Illumina data. Tools offered for evaluating large datasets need scrupulous review before they are implemented.National Institutes of Health [1UH2DK083993 to M.L.S.]; National Science Foundation [BDI- 096026 to S.M.H.]

    Next-generation sequencing : an eye-opener for the surveillance of antiviral resistance in influenza

    Get PDF
    Next-generation sequencing (NGS) can enable a more effective response to a wide range of communicable disease threats, such as influenza, which is one of the leading causes of human morbidity and mortality worldwide. After vaccination, antivirals are the second line of defense against influenza. The use of currently available antivirals can lead to antiviral resistance mutations in the entire influenza genome. Therefore, the methods to detect these mutations should be developed and implemented. In this Opinion, we assess how NGS could be implemented to detect drug resistance mutations in clinical influenza virus isolates

    Maximize Resolution or Minimize Error? Using Genotyping-By-Sequencing to Investigate the Recent Diversification of Helianthemum (Cistaceae)

    Get PDF
    A robust phylogenetic framework, in terms of extensive geographical and taxonomic sampling, well-resolved species relationships and high certainty of tree topologies and branch length estimations, is critical in the study of macroevolutionary patterns. Whereas Sanger sequencing-based methods usually recover insufficient phylogenetic signal, especially in recently diversified lineages, reduced-representation sequencing methods tend to provide well-supported phylogenetic relationships, but usually entail remarkable bioinformatic challenges due to the inherent trade-off between the number of SNPs and the magnitude of associated error rates. The genus Helianthemum (Cistaceae) is a species-rich and taxonomically complex Palearctic group of plants that diversified mainly since the Upper Miocene. It is a challenging case study since previous attempts using Sanger sequencing were unable to resolve the intrageneric phylogenetic relationships. Aiming to obtain a robust phylogenetic reconstruction based on genotyping-by-sequencing (GBS), we established a rigorous methodological workflow in which we i) explored how variable settings during dataset assembly have an impact on error rates and on the degree of resolution under concatenation and coalescent approaches, ii) assessed the effect of two extreme parameter configurations (minimizing error rates vs. maximizing phylogenetic resolution) on tree topology and branch lengths, and iii) evaluated the effects of these two configurations on estimates of divergence times and diversification rates. Our analyses produced highly supported topologically congruent phylogenetic trees for both configurations. However, minimizing error rates did produce more reliable branch lengths, critically affecting the accuracy of downstream analyses (i.e. divergence times and diversification rates). In addition to recommending a revision of intrageneric systematics, our results enabled us to identify three highly diversified lineages in Helianthemum in contrasting geographical areas and ecological conditions, which started radiating in the Upper Miocene.España, MINECO grants CGL2014- 52459-P and CGL2017-82465-PEspaña, Ministerio de Economía, Industria y Competitividad, reference IJCI-2015-2345

    Clinical Genetic Testing in Children with Kidney Disease

    Get PDF
    Chronic kidney disease, the presence of structural and functional abnormalities in the kidneys, is associated with a lower quality of life and increased morbidity and mortality in children. Genetic etiologies account for a substantial proportion of pediatric chronic kidney disease. With recent advances in genetic testing techniques, an increasing number of genetic causes of kidney disease continue to be found. Genetic testing is recommended in children with steroid-resistant nephrotic syndrome, congenital malformations of the kidney and urinary tract, cystic disease, or kidney disease with extrarenal manifestations. Diagnostic yields differ according to the category of clinical diagnosis and the choice of test. Here, we review the characteristics of genetic testing modalities and the implications of genetic testing in clinical genetic diagnostics

    Population-Sequencing as a Biomarker for Sample Characterization

    Get PDF

    The detection of high-qualified indels in exomes and their effect on cognition

    Full text link
    Plusieurs insertions/dĂ©lĂ©tions (indels) gĂ©nĂ©tiques ont Ă©tĂ© identifiĂ©es en lien avec des troubles du neurodĂ©veloppement, notamment le trouble du spectre de l’autisme (TSA) et la dĂ©ficience intellectuelle (DI). Bien que ce soit le deuxiĂšme type de variant le plus courant, la dĂ©tection et l’identification des indels demeure difficile Ă  ce jour, et on y retrouve un grand nombre de faux positifs. Ce projet vise Ă  trouver une mĂ©thode pour dĂ©tecter des indels de haute qualitĂ© ayant une forte probabilitĂ© d’ĂȘtre des vrais positifs. Un « ensemble de vĂ©ritĂ© » a Ă©tĂ© construit Ă  partir d’indels provenant de deux cohortes familiales basĂ© sur un diagnostic d’autisme. Ces indels ont Ă©tĂ© filtrĂ©s selon un ensemble de paramĂštres prĂ©dĂ©terminĂ©s et ils ont Ă©tĂ© appelĂ©s par plusieurs outils d’appel de variants. Cet ensemble a Ă©tĂ© utilisĂ© pour entraĂźner trois modĂšles d’apprentissage automatique pour identifier des indels de haute qualitĂ©. Par la suite, nous avons utilisĂ© ces modĂšles pour prĂ©dire des indels de haute qualitĂ© dans une cohorte de population gĂ©nĂ©rale, ayant Ă©tĂ© appelĂ© par une technologie d’appel de variant. Les modĂšles ont pu identifier des indels de meilleure qualitĂ© qui ont une association avec le QI, malgrĂ© que cet effet soit petit. De plus, les indels prĂ©dits par les modĂšles affectent un plus petit nombre de gĂšnes par individu que ceux ayant Ă©tĂ© filtrĂ©s par un seuil de rejet fixe. Les modĂšles ont tendance Ă  amĂ©liorer la qualitĂ© des indels, mais nĂ©cessiteront davantage de travail pour dĂ©terminer si ce serait possible de prĂ©dire les indels qui ont un effet non-nĂ©gligeable sur le QI.Genetic insertions/deletions (indels) have been linked to many neurodevelopmental disorders (NDDs) such as autism spectrum disorder (ASD) and intellectual disability (ID). However, although they are the second most common type of genetic variant, they remain to this day difficult to identify and verify, presenting a high number of false positives. We sought to find a method that would appropriately identify high-quality indels that are likely to be true positives. We built an indel “truth set” using indels from two diagnosis-based family cohorts that were filtered according to a set of threshold values and called by several variant calling tools in order to train three machine learning models to identify the highest quality indels. The two best performing models were then used to identify high quality indels in a general population cohort that was called using only one variant calling technology. The machine learning models were able to identify higher quality indels that showed a association with IQ, although the effect size was small. The indels predicted by the models also affected a much smaller number of genes per individual than those predicted through using minimum thresholds alone. The models tend to show an overall improvement in the quality of the indels but would require further work to see if it could a noticeable and significant effect on IQ
    corecore