4 research outputs found

    Simulating autosomal genotypes with realistic linkage disequilibrium and a spiked-in genetic effect

    No full text
    Abstract Background To evaluate statistical methods for genome-wide genetic analyses, one needs to be able to simulate realistic genotypes. We here describe a method, applicable to a broad range of association study designs, that can simulate autosome-wide single-nucleotide polymorphism data with realistic linkage disequilibrium and with spiked in, user-specified, single or multi-SNP causal effects. Results Our construction uses existing genome-wide association data from unrelated case-parent triads, augmented by including a hypothetical complement triad for each triad (same parents but with a hypothetical offspring who carries the non-transmitted parental alleles). We assign offspring qualitative or quantitative traits probabilistically through a specified risk model and show that our approach destroys the risk signals from the original data. Our method can simulate genetically homogeneous or stratified populations and can simulate case-parents studies, case-control studies, case-only studies, or studies of quantitative traits. We show that allele frequencies and linkage disequilibrium structure in the original genome-wide association sample are preserved in the simulated data. We have implemented our method in an R package (TriadSim) which is freely available at the comprehensive R archive network. Conclusion We have proposed a method for simulating genome-wide SNP data with realistic linkage disequilibrium. Our method will be useful for developing statistical methods for studying genetic associations, including higher order effects like epistasis and gene by environment interactions

    Additional file 1: Fig. S1. of Simulating autosomal genotypes with realistic linkage disequilibrium and a spiked-in genetic effect

    No full text
    Genotype correlation (R) between rare SNP pairs within 200Kb of each other in the original data plotted against the corresponding R in a single simulated data set. Red triangles represent the SNP pairs with an observed R that differs from that based on the original data by at least 0.1 (LD discrepant pairs). a) 0% discrepant among 16 pairs of SNPs both with 0.04 < MAF ≤ 0.05 in the original data; b) 0% discrepant among 26 pairs of SNPs both with 0.03 < MAF ≤ 0.04; c) 2.6% discrepant among 38 pairs of SNPs both with 0.02 < MAF ≤ 0.03; d) 8.6% discrepant among 35 pairs of SNPs both with 0.01 < MAF ≤ 0.02; e) 31% discrepant among 13 pairs of SNPs both with 0.005 < MAF ≤ 0.01; f) 14.2% discrepant among 296 pairs of SNPs both with MAF ≤ 0.005. Fig. S2. Average squared genotype correlations (R2) between loci plotted against the distance between them. This figure is similar to Fig. 2 in the text but instead it shows the LD decay for SNPs up to 200 kbps apart (to facilitate comparison to Additional file 1: Fig. S3). The black line shows the curve based on the original data while the red line shows the corresponding averaged value based on 1000 simulated data sets. The two lines coincide and only the red line is visible. Fig. S3. Average squared genotype correlations (R2) between loci plotted against the distance between them for rare SNPs. The black line shows the curve based on the original data while the red line shows the corresponding averaged value based on 1000 simulated data sets. When the two lines coincide only the red line is visible. a) 1782 pairs of SNPs both with MAF ≤ 0.05; b) 1495 pairs of SNPs both with MAF ≤ 0.04; c) 1147 pairs of SNPs both with MAF ≤ 0.03; d) 848 pairs of SNPs both with MAF ≤ 0.02; e) 593 pairs of SNPs both with MAF ≤ 0.01; f) 446 pairs of SNPs both with MAF ≤ 0.005. Fig. S4 Comparison of minor allele frequencies (MAFs) in the original data versus those in a single simulated data set for rare SNPs (MAF ≤ 0.05). The crosses represent the SNPs with MAF in the simulated data that fall outside 95% binomial prediction intervals calculated using the MAF in the original data as the true MAF (these MAF discrepant SNPs should make up about 5% of SNPS by definition). The colors denote SNPs in different MAF ranges in the original data: orange, 2.8% discrepant among 178 SNPs with 0.04 < MAF ≤ 0.05; blue, 5.6% discrepant among 214 SNPs with MAF 0.03 < MAF ≤ 0.04; green, 4.8% discrepant among 228 SNPs with 0.02 < MAF ≤ 0.03; purple, 5.2% discrepant among 248 SNPs with 0.01 < MAF ≤ 0.02; red, 7.9% discrepant among 151 SNPs with 0.005 < MAF ≤ 0.01; black, 4.7% discrepant among 852 SNPs with MAF ≤ 0.005. Overall, 4.97% of 1871 SNPs with MAF ≤ 0.05 lay outside their corresponding 95% prediction interval. Fig. S5 Empirical coverage of nominal 95% binomial prediction intervals for rare SNPs (MAF ≤ 0.05) plotted against the SNP’s minor allele frequency (MAF) in the original data. Prediction intervals are calculated for each SNP in each simulated data set using the SNP’s MAF in the original data as its true MAF. Empirical coverage for a SNP is calculated as the proportion of 1000 simulated data sets in which the SNP’s observed MAF was within its prediction interval. Each point represents empirical coverage for one of 1871 SNPs with MAF ≤ 0.05 in the simulations, based on 1000 simulated data sets. The horizontal reference lines correspond to mean and median coverage across all 10,279 SNPs in the simulations (both 95%, matching the nominal coverage) and to the 2.5th and 97.5th percentiles (93% and 97%, respectively). (DOCX 1187 kb

    METHODS FOR DETECTING HIGHER-ORDER GENETIC INTERACTIONS IN NUCLEAR-FAMILY-BASED STUDIES

    Get PDF
    Many diseases are believed to be complex in etiology, with risk influenced by multiple genetic variants, potentially jointly with environmental exposures. Genetic association studies have primarily focused on identifying disease-associated single nucleotide polymorphisms (SNPs) one-by-one, without considering possible synergistic effects on risk, in part due to methodological limitations. Sophisticated search algorithms are required to sift through the combinations of potentially interacting SNPs, but few have been developed for nuclear-family-based studies. This dissertation seeks to develop improved algorithms to mine for genetic interactions in nuclear-family-based data. In Chapter 2, we propose a genetic algorithm, called GADGETS (Genetic Algorithm for Detecting Genetic Epistasis using Triads or Siblings), to detect higher-order SNP-by-SNP interactions. We also develop permutation-based inferential procedures and a graphical approach for visualizing results. Through simulation, we demonstrate that GADGETS can often recover multiple interacting SNP-sets embedded among 10,000 candidates and that it outperforms existing methods. We further demonstrate its real-world use on publicly available data from a genetic association study of the birth defect orofacial clefting. In Chapter 3, we extend GADGETS to develop E-GADGETS, which can search for higher-order SNP-by-exposure interactions. We show through simulation that E-GADGETS can often recover multiple SNP-sets whose joint relationship to risk varies with exposure, regardless of whether the exposure is continuous or categorical, and even if we require searching 50,000 candidate SNPs. We further demonstrate E-GADGETS outperforms existing competitors. When applied to a case-parents dataset of children with cleft-palate-only birth defects from dbGaP, E-GADGETS detected evidence for risk-associated genetic interactions with prenatal maternal exposure to environmental tobacco smoke. In Chapter 4, we further extend GADGETS to flexibly search for higher-order maternal-fetal or maternally-mediated genetic interactions. In simulations based on 10,000 candidate SNPs, we show that GADGETS usually recovers SNP-sets that exhibit either type of effect, or, given two risk-related SNP-sets, can recover both. We also show through simulations that GADGETS outperforms competing methods. With real orofacial cleft data, GADGETS nominated potentially risk-related maternal-fetal interactions when applied separately in Asian and in European ancestry groups.Doctor of Philosoph
    corecore