18 research outputs found

    A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data

    Get PDF
    The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality

    Replication and exploratory analysis of 24 candidate risk polymorphisms for neural tube defects.

    Get PDF
    BackgroundNeural tube defects (NTDs), which are among the most common congenital malformations, are influenced by environmental and genetic factors. Low maternal folate is the strongest known contributing factor, making variants in genes in the folate metabolic pathway attractive candidates for NTD risk. Multiple studies have identified nominally significant allelic associations with NTDs. We tested whether associations detected in a large Irish cohort could be replicated in an independent population.MethodsReplication tests of 24 nominally significant NTD associations were performed in racially/ethnically matched populations. Family-based tests of fifteen nominally significant single nucleotide polymorphisms (SNPs) were repeated in a cohort of NTD trios (530 cases and their parents) from the United Kingdom, and case-control tests of nine nominally significant SNPs were repeated in a cohort (190 cases, 941 controls) from New York State (NYS). Secondary hypotheses involved evaluating the latter set of nine SNPs for NTD association using alternate case-control models and NTD groupings in white, African American and Hispanic cohorts from NYS.ResultsOf the 24 SNPs tested for replication, ADA rs452159 and MTR rs10925260 were significantly associated with isolated NTDs. Of the secondary tests performed, ARID1A rs11247593 was associated with NTDs in whites, and ALDH1A2 rs7169289 was associated with isolated NTDs in African Americans.ConclusionsWe report a number of associations between SNP genotypes and neural tube defects. These associations were nominally significant before correction for multiple hypothesis testing. These corrections are highly conservative for association studies of untested hypotheses, and may be too conservative for replication studies. We therefore believe the true effect of these four nominally significant SNPs on NTD risk will be more definitively determined by further study in other populations, and eventual meta-analysis

    Accurate and comprehensive sequencing of personal genomes

    No full text
    As whole-genome sequencing becomes commoditized and we begin to sequence and analyze personal genomes for clinical and diagnostic purposes, it is necessary to understand what constitutes a complete sequencing experiment for determining genotypes and detecting single-nucleotide variants. Here, we show that the current recommendation of ∼30× coverage is not adequate to produce genotype calls across a large fraction of the genome with acceptably low error rates. Our results are based on analyses of a clinical sample sequenced on two related Illumina platforms, GAIIx and HiSeq 2000, to a very high depth (126×). We used these data to establish genotype-calling filters that dramatically increase accuracy. We also empirically determined how the callable portion of the genome varies as a function of the amount of sequence data used. These results help provide a “sequencing guide” for future whole-genome sequencing decisions and metrics by which coverage statistics should be reported

    Summary of data used in these analyses.

    No full text
    <p>Number of reads reflects the number of aligning reads after removing duplicate read pairs and filtering for low quality alignments (see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#s4" target="_blank">Methods</a>). Gender was determined by looking at coverage of reads in specific representative regions of the X and Y chromosomes (see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#s4" target="_blank">Methods</a>). Number of genotypes called is from the autosomes only, which is what was used for downstream comparisons.</p

    Effect of data quantity on concordance rates.

    No full text
    <p>The total number of reads used in the analysis affects different-sample comparisons, but not same-sample comparisons. In (A), lane 7 of sample ID B was kept constant at 140 million reads (B7), and the amount of data for the other sample [either lane 8 of B (B8) or lane 7 of C (C7)] was varied between 40 million and 140 million reads (x-axis) in 20 million read increments. The y-axis represents the concordance rate between variant (nonreference) genotypes called between the two different datasets. Note that for the same-sample comparison (red line), varying the number of reads used in the analysis does not substantially alter the concordance rate. However, this is not the case for different-sample comparisons (blue line), where the concordance rate becomes more different as more reads are used. In (B), a similar trend is observed when the reads in both samples are incremented simultaneously. Solid lines represent a LOESS smoothed fit to the data points.</p

    Concordance between lanes.

    No full text
    <p>Distributions of genotype concordance rates from same- and different-sample comparisons are non-overlapping. The box plot in (A) shows the distributions of concordance rates when using all callable positions for all combinations of pairs of the three samples being analyzed. The x-axis denotes each pair being compared (A, B, and C, refer to the sample IDs in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#pone-0023683-t001" target="_blank">Table 1</a>), and the y-axis represents the distribution of concordance rates for all pair-wise combinations of lanes representing the specific pair of samples on the x-axis. It is likely that the detected differences from same-sample comparisons (B–B, C–C, and A–A) arise solely from sequencing and genotyping error. The box plot in (B) is similar to (A), except that here only variant (nonreference) positions are considered. The symmetrical heat map in (C) summarizes the data from panel (A); the blue boxes represent low concordance rates and correspond to different-sample comparisons, while the yellow boxes along the diagonal represent high concordance rates and correspond to same-sample comparisons. Note that comparisons between samples B and C (gray boxes) are slightly more similar to each other than the other different-sample comparisons, but still sufficiently distinct from same-sample comparisons. This is expected given the known partial consanguinity between these individuals.</p

    Overview of approach.

    No full text
    <p>Several lanes of HiSeq 2000 data are typically combined together for a comprehensive genome analysis, giving a high depth of coverage (A), and the ability to accurately call genotypes in the majority of the genome. In (B), two individual lanes of HiSeq 2000 data are depicted, with a lower average depth of coverage. By chance, some regions of the genome have enough data to be genotyped in both lanes (shaded gray).</p
    corecore