44 research outputs found

    Data collection strategies for studies with prior array data.

    No full text
    <p>For the (<b>a</b>) Affy 6 and (<b>b</b>) Ilmn 1 M arrays, we produced joint calls after addition of each sequence coverage or array; joint calls with multiple arrays include combined data from both arrays. The y-axis shows , while the x-axis shows of the additional data collected. is a measure of the genotyping investment intrinsic to a technology that serves as a proxy for cost. The blue point (None) shows if no additional data is collected; the other points are labeled with the additional data collected. Labels are defined in 2.</p

    A novel next-generation sequencing error mode.

    No full text
    <p>(<b>a</b>) We identified a novel error mode based on visual examination of disputed SNPs. As shown in the cluster plot, one of the samples is called homozygous reference (Hom-ref) based on analysis of array data but homozygote non-reference (Hom-var) based on analysis of sequence data (shown by the sample outlined in green within the red cluster). This unusual error mode contrasts with the more common error mode, due to low sequence coverage, of samples called heterozygous (Het) based on array data but homozygous reference or non-reference based on sequence data (shown by samples outlined in pink or green within the blue cluster). (<b>b</b>) Inspection of the sequence reads in the Integrated Genomics Viewer (IGV) <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#pcbi.1002604-Robinson1" target="_blank">[54]</a> shows that the sample in question has only two reads that cover this SNP, and these reads are pairs sequenced from the same underlying DNA fragment. (<b>c</b>) This error mode is introduced in the shearing and library preparation stage of next-generation sequencing and as a result is reflected in both reads from the same DNA fragment. Depending on protocol details, the error rate is around 1/10,000. During genotype calling, independent treatment of reads (read-based) results in much more confident (here 100×) non-reference genotype calls than analysis at the fragment level (fragment-based). (<b>d</b>) To account for these effects, which can be large for low coverage sequencing projects like the 1000G Project, we implemented a fragment based genotyping algorithm in the Unified Genotyper of the Genome Analysis Toolkit (GATK). Use of this new caller has a significant impact on SNP call quality, shown by a smaller number of novel SNP calls and a higher Transition∶Transversion ratio (proxies for accuracy <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#pcbi.1002604-DePristo1" target="_blank">[27]</a>). The effect is pronounced for populations such as MXL and ASW, which have a higher fraction of newer Illumina sequencing data with longer reads (e.g., AWS data is reads, while YRI has less than ), which results in greatly increased rate of overlapping reads and associated errors. Abbreviations are as defined in the 1000G Project.</p

    Reduction in errors from joint genotype calls.

    No full text
    <p>(<b>a</b>) To assess the improvement in imputation quality afforded by joint genotype calls with a SNP array (relative to calls based on sequence data alone), we measured sensitivity and specificity at sites absent from the array; errors at these sites can be reduced only through improved imputation. The Metabochip is absent from this plot, as it is not a genome-wide array. Plotted are and , the sum of which equals the number of sites where (1) the gold-standard or called genotype is non-reference and (2) the gold-standard and called genotypes disagree. Normalized values (defined in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#s4" target="_blank">Materials and Methods</a>) are plotted to show visual trends; actual values are given in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#pcbi.1002604.s016" target="_blank">Figure S16</a>(<b>b</b>) To assess the reduction in erroneous genotype cluster locations afforded by joint genotype calls with sequence data (relative to calls based on array data alone), we measured sensitivity and specificity at sites on the array. Red bars correspond to and , measured from calls without haplotype phasing; blue bars correspond to and , measured from joint calls. As described in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#s4" target="_blank">Materials and Methods</a>, these experiments used 82 additional unrelated samples, absent from our other experiments, to inform cluster locations.</p

    Sensitivity and specificity of data collection strategies.

    No full text
    <p>For different combinations of array and sequence data, we produced joint genotype calls on chromosome 20 for 382 European samples from the 1000G project. For a single test sample, we obtained “gold-standard” genotypes from high coverage multi-technology sequencing published by the 1000G project. We then measured non-reference site sensitivity and specificity with imputation (, ) and without (, ). (<b>a</b>) (left) and (right) of calls from five array densities and four sequence coverages. The first row of each table contains results for strategies with only sequence data, and the first column contains results for strategies with only array data. A common color scheme is used across all tables, with white corresponding to 100%, red corresponding to , and yellow corresponding to 80%. (<b>b</b>) of calls; is given in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#pcbi.1002604.s009" target="_blank">Figure S9</a>. (<b>c</b>) for three variant frequency ranges, with frequency estimated from the non-test samples. Private variants have frequency 0% in the non-test samples. (<b>d</b>) for four sequence coverages, with separate lines that correspond to joint calls made with each SNP array. (<b>e</b>) for four array densities, with separate lines that correspond to joint calls made with each sequence coverage. No Array: from sequence data alone; 0×: from array data alone; .5×-4×: mean number of sequence reads per genomic position; array abbreviations are defined in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002604#s4" target="_blank">Materials and Methods</a>.</p

    Power of best-performing gene-based rare variant method as compared to single variant association.

    No full text
    <p>Power is measured across one hundred simulations of phenotypic effects at each of 24 human gene loci in N = 3K samples (as in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g002" target="_blank">Fig 2</a>). Under each architecture (AR1, AR2, AR3), the power of the best-performing gene-based test at alpha = 2.5e-06 (SKAT-O) is compared to single variant association (Fisher’s exact) at alpha = 5e-08 (panels A, C, E). No MAF threshold was applied to the single variant association tests; gene-based tests included only variants with MAF<1%. Blue boxplot shows range of power for single variant association across genes simulated; pink shows power of the gene-based test alone; green shows the fraction of loci detected only by gene-based test (and not single variant association); yellow shows the combined power of both gene-based and single variant association. Next to each boxplot (panels B, D, F) are scatterplots on which each simulated locus (under AR1, AR2, and AR3, respectively) is represented as a point based on the minor allele frequency (x-axis) and association p-value (y-axis) of the single most-associated variant (the top individual signal) across the locus. Single variant association detects loci plotted above the upper dotted line (at 5e-08), while gene-based association identifies a distinct subset of loci (those highlighted in pink, where the SKAT-O p-value is <2.5e-06). This latter group of loci are those where the top single variant is preferentially rare (and no common variant association signal exists); right-most scatterplots zoom into this portion of the x-axis (MAF<1%). Similar plots for AR4, AR5, and AR6 are shown in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.s011" target="_blank">S10 Fig</a>.</p

    Power of different gene-based rare variant association methods at simulated disease loci.

    No full text
    <p>At each gene locus, one hundred independent simulations of phenotypic effects were generated in a sample size of 3K individuals (1.5K cases / 1.5K controls). Variant effects were drawn from varied models of genetic architecture (<b>A-F</b>), hypothesizing different degrees of purifying selection against disease alleles (see <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#sec007" target="_blank">Methods</a>). Under models with strong selection, there is a strong inverse correlation between variant frequency and effect size; under weak selection rare variant effects are less skewed. At all loci, genetic variants together contribute 1% of the phenotypic variance underlying a trait with common prevalence (8%; modeled as type 2 diabetes). Power is measured as the fraction out of 100 simulations of each gene in which a gene-based test reported a p-value lower than the significance threshold. In (<b>A-C</b>), causal variants span the full frequency spectrum (including common alleles), and thus rare alleles account for only a fraction of the locus heritability; in (<b>D-E</b>), all causal variants are rare (MAF<1%). In (<b>F</b>), causal variants have bi-directional effects (some increase risk of disease, while others reduce risk).</p

    Properties of loci at which gene-based methods report discordant results.

    No full text
    <p>Characteristics of causal loci at which KBAC (the method with highest mean power at nominal levels of significance) produces discordant results as compared to another gene-based method. Results are shown above for the simulated architecture AR2 in 3K samples. KBAC is compared to the <b>(A)</b> C-ALPHA, <b>(B)</b> BURDEN, and <b>(C)</b> UNIQ gene-based methods. In each comparison, loci are identified at which KBAC (but not the other method) reports a p-value < 0.01, or at which the other method (but not KBAC) reports a p-value < 0.01. For each group of loci, leftmost vioplot shows the distribution of aggregate case:control counts (number of minor alleles observed in cases divided by number of minor alleles observed in controls, for variants with MAF<1%). Middle vioplot shows distribution of case-unique counts (number of observations of alleles that are only present in cases and absent from controls). Rightmost vioplot shows distribution of the top single variant p-value observed for an exonic variant at the locus (log10 scale). Line plots at right show the distribution of variants (MAF < 1%) at representative simulated loci where the methods are discordant. Each line represents a variant; height above line measures the variant’s case counts, while height below measures control counts. Red lines highlight variants which drive the difference in test performance.</p

    Power of gene-based methods as a function of sample size, locus effect size, and neutral variation.

    No full text
    <p>Power was measured across one hundred simulations at each of 24 gene loci (as in Figs <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g002" target="_blank">2</a> and <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g003" target="_blank">3</a>). Across all panels above, variant effects were drawn from the architecture model AR2 (assuming moderate selection against causal variants, and thus modest inverse correlation between variant frequency and effect size). In <b>(A)</b>, variant effects were sampled at each locus such that the total fraction of phenotypic variance explained by the locus was ~0.5%, 1% (as in Figs <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g002" target="_blank">2</a> and <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g003" target="_blank">3</a>) or 2%. In <b>(B)</b>, loci were simulated to explain 1% of phenotypic variance in sample sizes of 1.5K cases/1.5K controls (as in Figs <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g002" target="_blank">2</a> and <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005165#pgen.1005165.g003" target="_blank">3</a>) and 5K cases/5K controls. In both <b>(A)</b> and <b>(B)</b>, all exonic variants with MAF < 1% were included in the burden test (both causal and non-causal variants, resulting in a fewer than 50% of all tested variants being causal). In <b>(C)</b>, non-causal (neutral) variants were selectively removed such that the ratio of causal variants to total variants tested ranged from 0.25 to 1 (only causal variants tested). The gene-based methods each have varied performance under different locus effect sizes, sample sizes, and causal variant filtering scenarios.</p

    Generation of simulated genotype data at human gene loci in large sample sizes with HAPGEN2.

    No full text
    <p>Haplotypes were simulated at ‘average’ human protein-coding genes drawn from the center of the distribution of RefSeq gene total exon length <b>(A)</b>. Vertical dotted lines in red and green indicate the median and mean values of exon length, respectively. Black bar represents the 24 genes selected for simulation. <b>(B,C)</b> Site frequency spectrum of simulated data, as compared to observed human data. Data were simulated via staged expansion of 1000 Genomes Project haplotypes using the HAPGEN2 software; the mutation parameter was fit to match the site frequency spectrum of protein-coding variation observed in exome sequencing studies, e.g. as reported Nelson et al 2012. Raw simulated data from HAPGEN2 in large sample sizes produced an excess of rare sites; these were down-sampled to match observed data. The grey area in <b>(B)</b> represents the [5%, 95%] interval across all simulated genes, obtained using bootstrapping. The site frequency spectrum of simulated data in a smaller sample size (N = 2.7K) also matched an independent set of observed exome sequencing data from the GoT2D consortium <b>(C)</b>. Haplotype structure, as measured by linkage disequilibrium between variants, was also preserved in the simulated data after sample expansion <b>(D)</b>. The inset shows a representative example of simulations at the GATA3 gene locus.</p
    corecore