2,642 research outputs found

    Variant detection sensitivity and biases in whole genome and exome sequencing

    Get PDF
    BACKGROUND: Less than two percent of the human genome is protein coding, yet that small fraction harbours the majority of known disease causing mutations. Despite rapidly falling whole genome sequencing (WGS) costs, much research and increasingly the clinical use of sequence data is likely to remain focused on the protein coding exome. We set out to quantify and understand how WGS compares with the targeted capture and sequencing of the exome (exome-seq), for the specific purpose of identifying single nucleotide polymorphisms (SNPs) in exome targeted regions. RESULTS: We have compared polymorphism detection sensitivity and systematic biases using a set of tissue samples that have been subject to both deep exome and whole genome sequencing. The scoring of detection sensitivity was based on sequence down sampling and reference to a set of gold-standard SNP calls for each sample. Despite evidence of incremental improvements in exome capture technology over time, whole genome sequencing has greater uniformity of sequence read coverage and reduced biases in the detection of non-reference alleles than exome-seq. Exome-seq achieves 95% SNP detection sensitivity at a mean on-target depth of 40 reads, whereas WGS only requires a mean of 14 reads. Known disease causing mutations are not biased towards easy or hard to sequence areas of the genome for either exome-seq or WGS. CONCLUSIONS: From an economic perspective, WGS is at parity with exome-seq for variant detection in the targeted coding regions. WGS offers benefits in uniformity of read coverage and more balanced allele ratio calls, both of which can in most cases be offset by deeper exome-seq, with the caveat that some exome-seq targets will never achieve sufficient mapped read depth for variant detection due to technical difficulties or probe failures. As WGS is intrinsically richer data that can provide insight into polymorphisms outside coding regions and reveal genomic rearrangements, it is likely to progressively replace exome-seq for many applications. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/1471-2105-15-247) contains supplementary material, which is available to authorized users

    Quantifying single nucleotide variant detection sensitivity in exome sequencing

    Get PDF
    BACKGROUND: The targeted capture and sequencing of genomic regions has rapidly demonstrated its utility in genetic studies. Inherent in this technology is considerable heterogeneity of target coverage and this is expected to systematically impact our sensitivity to detect genuine polymorphisms. To fully interpret the polymorphisms identified in a genetic study it is often essential to both detect polymorphisms and to understand where and with what probability real polymorphisms may have been missed. RESULTS: Using down-sampling of 30 deeply sequenced exomes and a set of gold-standard single nucleotide variant (SNV) genotype calls for each sample, we developed an empirical model relating the read depth at a polymorphic site to the probability of calling the correct genotype at that site. We find that measured sensitivity in SNV detection is substantially worse than that predicted from the naive expectation of sampling from a binomial. This calibrated model allows us to produce single nucleotide resolution SNV sensitivity estimates which can be merged to give summary sensitivity measures for any arbitrary partition of the target sequences (nucleotide, exon, gene, pathway, exome). These metrics are directly comparable between platforms and can be combined between samples to give “power estimates” for an entire study. We estimate a local read depth of 13X is required to detect the alleles and genotype of a heterozygous SNV 95% of the time, but only 3X for a homozygous SNV. At a mean on-target read depth of 20X, commonly used for rare disease exome sequencing studies, we predict 5–15% of heterozygous and 1–4% of homozygous SNVs in the targeted regions will be missed. CONCLUSIONS: Non-reference alleles in the heterozygote state have a high chance of being missed when commonly applied read coverage thresholds are used despite the widely held assumption that there is good polymorphism detection at these coverage levels. Such alleles are likely to be of functional importance in population based studies of rare diseases, somatic mutations in cancer and explaining the “missing heritability” of quantitative traits

    Analysis of Archived Residual Newborn Screening Blood Spots After Whole Genome Amplification

    Get PDF
    Deidentified newborn screening bloodspot samples (NBS) represent a valuable potential resource for genomic research if impediments to whole exome sequencing of NBS deoxyribonucleic acid (DNA), including the small amount of genomic DNA in NBS material, can be overcome. For instance, genomic analysis of NBS could be used to define allele frequencies of disease-associated variants in local populations, or to conduct prospective or retrospective studies relating genomic variation to disease emergence in pediatric populations over time. In this study, we compared the recovery of variant calls from exome sequences of amplified NBS genomic DNA to variant calls from exome sequencing of non-amplified NBS DNA from the same individuals. Results: Using a standard alignment-based Genome Analysis Toolkit (GATK), we find 62,000-76,000 additional variants in amplified samples. After application of a unique kmer enumeration and variant detection method (RUFUS), only 38,000-47,000 additional variants are observed in amplified gDNA. This result suggests that roughly half of the amplification-introduced variants identified using GATK may be the result of mapping errors and read misalignment. Conclusions: Our results show that it is possible to obtain informative, high-quality data from exome analysis of whole genome amplified NBS with the important caveat that different data generation and analysis methods can affect variant detection accuracy, and the concordance of variant calls in whole-genome amplified and non-amplified exomes.National Institute of Health P01HD067244, NS076465, R01ES021006Nutritional Science

    Clinical exome performance for reporting secondary genetic findings.

    Get PDF
    BACKGROUND : Reporting clinically actionable incidental genetic findings in the course of clinical exome testing is recommended by the American College of Medical Genet- ics and Genomics (ACMG). However, the performance of clinical exome methods for reporting small subsets of genes has not been previously reported. METHODS : In this study, 57 exome data sets performed as clinical (n ! 12) or research (n ! 45) tests were retrospec- tively analyzed. Exome sequencing data was examined for adequacy in the detection of potentially pathogenic variant locations in the 56 genes described in the ACMG incidental findings recommendation. All exons of the 56 genes were examined for adequacy of sequencing coverage. In addition, nucleotide positions annotated in HGMD (Human Gene Mutation Database) were examined. RESULTS : The 56 ACMG genes have 18336 nucleotide variants annotated in HGMD. None of the 57 exome data sets possessed a HGMD variant. The clinical exome test had inadequate coverage for " 50% of HGMD vari- ant locations in 7 genes. Six exons from 6 different genes had consistent failure across all 3 test methods; these exons had high GC content (76%–84%). CONCLUSIONS : The use of clinical exome sequencing for the interpretation and reporting of subsets of genes requires recognition of the substantial possibility of inadequate depth and breadth of sequencing coverage at clinically relevant locations. Inadequate depth of coverage may contribute to false-negative clinical ex- ome results

    A systematic evaluation of hybridization-based mouse exome capture system

    Get PDF
    BACKGROUND: Exome sequencing is increasingly used to search for phenotypically-relevant sequence variants in the mouse genome. All of the current hybridization-based mouse exome capture systems are designed based on the genome reference sequences of the C57BL/6 J strain. Given that the substantial sequence divergence exists between C57BL/6 J and other distantly-related strains, the impact of sequence divergence on the efficiency of such capture systems needs to be systematically evaluated before they can be widely applied to the study of those strains. RESULTS: Using the Agilent SureSelect mouse exome capture system, we performed exome sequencing on F1 generation hybrid mice that were derived by crossing two divergent strains, C57BL/6 J and SPRET/EiJ. Our results showed that the C57BL/6 J-based probes captured the sequences derived from C57BL/6 J alleles more efficiently and that the bias was higher for the target regions with greater sequence divergence. At low sequencing depths, the bias also affected the efficiency of variant detection. However, the effects became negligible when sufficient sequencing depth was achieved. CONCLUSION: Sufficient sequence depth needs to be planned to match the sequence divergence between C57BL/6 J and the strain to be studied, when the C57BL/6 J --based Agilent SureSelect exome capture system is to be used

    Empowering precision medicine through high performance computing clusters

    Get PDF
    The role of High Performance Computing (HPC) in Medicine is greatly increase in these last years, moving from basic research to the clinics. With the advent of Next Generation Sequencing (NGS) technologies, diverse areas of human health have been investigated through different omics techniques. The extensive use of these NGS platforms to high throughput profile human health issues in a cost-efficient manner, is generating huge amount of sequencing data pushing " (https://www.facebook.com/pages/Oatext/1439466783004774) # $ % (https://www.youtube.com/user/users/oatext) ○ ○ ○ Article Article Info Author Info Figures & Data bioinformatic research in the big-data field. Speed, accuracy and reproducibility of massively sequencing analysis have allowed to transfer molecular biology knowledge into precision medicine. Furthermore, Molecular Dynamics (MD) earned a great importance in aiding genome research. Sequencing studies of cancer have allowed to detect and characterize mutated genes that drive tumorigenesis. As a complementary approach, from a biophysical perspective, MD simulations, executed on HPC architectures, have permitted to investigate the role played by pathological mutations on the molecular mechanism of activation

    Diagnostic applications of next generation sequencing: working towards quality standards

    Get PDF
    Over the past 6 years, next generation sequencing (NGS) has been established as a valuable high-throughput method for research in molecular genetics and has successfully been employed in the identification of rare and common genetic variations. All major NGS technology companies providing commercially available instruments (Roche 454, Illumina, Life Technologies) have recently marketed bench top sequencing instruments with lower throughput and shorter run times, thereby broadening the applications of NGS and opening the technology to the potential use for clinical diagnostics. Although the high expectations regarding the discovery of new diagnostic targets and an overall reduction of cost have been achieved, technological challenges in instrument handling, robustness of the chemistry and data analysis need to be overcome. To facilitate the implementation of NGS as a routine method in molecular diagnostics, consistent quality standards need to be developed. Here the authors give an overview of the current standards in protocols and workflows and discuss possible approaches to define quality criteria for NGS in molecular genetic diagnostics

    Towards Better Understanding of Artifacts in Variant Calling from High-Coverage Samples

    Full text link
    Motivation: Whole-genome high-coverage sequencing has been widely used for personal and cancer genomics as well as in various research areas. However, in the lack of an unbiased whole-genome truth set, the global error rate of variant calls and the leading causal artifacts still remain unclear even given the great efforts in the evaluation of variant calling methods. Results: We made ten SNP and INDEL call sets with two read mappers and five variant callers, both on a haploid human genome and a diploid genome at a similar coverage. By investigating false heterozygous calls in the haploid genome, we identified the erroneous realignment in low-complexity regions and the incomplete reference genome with respect to the sample as the two major sources of errors, which press for continued improvements in these two areas. We estimated that the error rate of raw genotype calls is as high as 1 in 10-15kb, but the error rate of post-filtered calls is reduced to 1 in 100-200kb without significant compromise on the sensitivity. Availability: BWA-MEM alignment: http://bit.ly/1g8XqRt; Scripts: https://github.com/lh3/varcmp; Additional data: https://figshare.com/articles/Towards_better_understanding_of_artifacts_in_variating_calling_from_high_coverage_samples/981073Comment: Published versio
    corecore