444 research outputs found

    Self-Contained Gene-Set Analysis of Expression Data: An Evaluation of Existing and Novel Methods

    Get PDF
    Gene set methods aim to assess the overall evidence of association of a set of genes with a phenotype, such as disease or a quantitative trait. Multiple approaches for gene set analysis of expression data have been proposed. They can be divided into two types: competitive and self-contained. Benefits of self-contained methods include that they can be used for genome-wide, candidate gene, or pathway studies, and have been reported to be more powerful than competitive methods. We therefore investigated ten self-contained methods that can be used for continuous, discrete and time-to-event phenotypes. To assess the power and type I error rate for the various previously proposed and novel approaches, an extensive simulation study was completed in which the scenarios varied according to: number of genes in a gene set, number of genes associated with the phenotype, effect sizes, correlation between expression of genes within a gene set, and the sample size. In addition to the simulated data, the various methods were applied to a pharmacogenomic study of the drug gemcitabine. Simulation results demonstrated that overall Fisher's method and the global model with random effects have the highest power for a wide range of scenarios, while the analysis based on the first principal component and Kolmogorov-Smirnov test tended to have lowest power. The methods investigated here are likely to play an important role in identifying pathways that contribute to complex traits

    Comparison of variable and model selection methods for genetic association studies using the GAW15 simulated data

    Get PDF
    We compared and evaluated several variable and model selection methods using Bayesian and non-Bayesian approaches for three replicates of the Genetic Analysis Workshop 15 (GAW15) simulated data. In doing so, two phenotypes were utilized: rheumatoid arthritis (RA) affection status as a binary trait and IgM as a continuous measure. The analyses were performed adjusting for sex, age, and smoking status. For both outcomes, all the methods were comparable in finding the single-nucleotide polymorphisms (SNPs) generated to have a genetic signal. We successfully identified the susceptibility SNPs for RA in the HLA region (chromosome 6), and chromosome 18, and the susceptibility SNP for IgM on chromosome 11; however, many of the methods produced false-positive results

    Utilizing Genotype Imputation for the Augmentation of Sequence Data

    Get PDF
    In recent years, capabilities for genotyping large sets of single nucleotide polymorphisms (SNPs) has increased considerably with the ability to genotype over 1 million SNP markers across the genome. This advancement in technology has led to an increase in the number of genome-wide association studies (GWAS) for various complex traits. These GWAS have resulted in the implication of over 1500 SNPs associated with disease traits. However, the SNPs identified from these GWAS are not necessarily the functional variants. Therefore, the next phase in GWAS will involve the refining of these putative loci.A next step for GWAS would be to catalog all variants, especially rarer variants, within the detected loci, followed by the association analysis of the detected variants with the disease trait. However, sequencing a locus in a large number of subjects is still relatively expensive. A more cost effective approach would be to sequence a portion of the individuals, followed by the application of genotype imputation methods for imputing markers in the remaining individuals. A potentially attractive alternative option would be to impute based on the 1000 Genomes Project; however, this has the drawbacks of using a reference population that does not necessarily match the disease status and LD pattern of the study population. We explored a variety of approaches for carrying out the imputation using a reference panel consisting of sequence data for a fraction of the study participants using data from both a candidate gene sequencing study and the 1000 Genomes Project.Imputation of genetic variation based on a proportion of sequenced samples is feasible. Our results indicate the following sequencing study design guidelines which take advantage of the recent advances in genotype imputation methodology: Select the largest and most diverse reference panel for sequencing and genotype as many "anchor" markers as possible

    A Latent Model for Prioritization of SNPs for Functional Studies

    Get PDF
    One difficult question facing researchers is how to prioritize SNPs detected from genetic association studies for functional studies. Often a list of the top M SNPs is determined based on solely the p-value from an association analysis, where M is determined by financial/time constraints. For many studies of complex diseases, multiple analyses have been completed and integrating these multiple sets of results may be difficult. One may also wish to incorporate biological knowledge, such as whether the SNP is in the exon of a gene or a regulatory region, into the selection of markers to follow-up. In this manuscript, we propose a Bayesian latent variable model (BLVM) for incorporating “features” about a SNP to estimate a latent “quality score”, with SNPs prioritized based on the posterior probability distribution of the rankings of these quality scores. We illustrate the method using data from an ovarian cancer genome-wide association study (GWAS). In addition to the application of the BLVM to the ovarian GWAS, we applied the BLVM to simulated data which mimics the setting involving the prioritization of markers across multiple GWAS for related diseases/traits. The top ranked SNP by BLVM for the ovarian GWAS, ranked 2nd and 7th based on p-values from analyses of all invasive and invasive serous cases. The top SNP based on serous case analysis p-value (which ranked 197th for invasive case analysis), was ranked 8th based on the posterior probability of being in the top 5 markers (0.13). In summary, the application of the BLVM allows for the systematic integration of multiple SNP “features” for the prioritization of loci for fine-mapping or functional studies, taking into account the uncertainty in ranking

    Integrative Gene Set Analysis: Application to Platinum Pharmacogenomics

    Get PDF
    Integrative genomics has the potential to uncover relevant loci, as clinical outcome and response to chemotherapies are most likely not due to a single gene (or data type) but rather a complex relationship involving genetic variation, mRNA, DNA methylation, and copy number variation. In addition to this complexity, many complex phenotypes are thought to be controlled by the interplay of multiple genes within the same molecular pathway or gene set (GS). To address these two challenges, we propose an integrative gene set analysis approach and apply this strategy to a cisplatin (CDDP) pharmacogenomics study involving lymphoblastoid cell lines for which genome-wide SNP and mRNA expression data was collected. Application of the integrative GS analysis implicated the role of the RNA binding and cytoskeletal part GSs. The genes LMNB1 and CENPF, within the cytoskeletal part GS, were functionally validated with siRNA knockdown experiments, where the knockdown of LMNB1 and CENPF resulted in CDDP resistance in multiple cancer cell lines. This study demonstrates the utility of an integrative GS analysis strategy for detecting novel genes associated with response to cancer therapies, moving closer to tailored therapy decisions for cancer patients.National Institutes of Health (U.S.) (NIH/NCI GM61388)National Institutes of Health (U.S.) (NIH/NCI CA140879)National Institutes of Health (U.S.) (NIH/NCI GM86689)National Institutes of Health (U.S.) (NIH/NCI CA130828)National Institutes of Health (U.S.) (NIH/NCI CA138461)National Institutes of Health (U.S.) (NIH/NCI CA102701)Mayo Foundation for Medical Education and Researc

    Comparison of tagging single-nucleotide polymorphism methods in association analyses

    Get PDF
    Several methods to identify tagging single-nucleotide polymorphisms (SNPs) are in common use for genetic epidemiologic studies; however, there may be loss of information when using only a subset of SNPs. We sought to compare the ability of commonly used pairwise, multimarker, and haplotype-based tagging SNP selection methods to detect known associations with quantitative expression phenotypes. Using data from HapMap release 21 on unrelated Utah residents with ancestors from northern and western Europe (CEPH-Utah, CEU), we selected tagging SNPs in five chromosomal regions using ldSelect, Tagger, and TagSNPs. We found that SNP subsets did not substantially overlap, and that the use of trio data did not greatly impact SNP selection. We then tested associations between HapMap genotypes and expression phenotypes on 28 CEU individuals as part of Genetic Analysis Workshop 15. Relative to the use of all SNPs (n = 210 SNPs across all regions), most subset methods were able to detect single-SNP and haplotype associations. Generally, pairwise selection approaches worked extremely well, relative to use of all SNPs, with marked reductions in the number of SNPs required. Haplotype-based approaches, which had identified smaller SNP subsets, missed associations in some regions. We conclude that the optimal tagging SNP method depends on the true model of the genetic association (i.e., whether a SNP or haplotype is responsible); unfortunately, this is often unknown at the time of SNP selection. Additional evaluations using empirical and simulated data are needed

    Methylation of Leukocyte DNA and Ovarian Cancer: Relationships with Disease Status and Outcome

    Get PDF
    Genome-wide interrogation of DNA methylation (DNAm) in blood-derived leukocytes has become feasible with the advent of CpG genotyping arrays. In epithelial ovarian cancer (EOC), one report found substantial DNAm differences between cases and controls; however, many of these disease-associated CpGs were attributed to differences in white blood cell type distributions. We examined blood-based DNAm in 336 EOC cases and 398 controls; we included only high-quality CpG loci that did not show evidence of association with white blood cell type distributions to evaluate association with case status and overall survival

    Assessment of genotype imputation methods

    Get PDF
    Several methods have been proposed to impute genotypes at untyped markers using observed genotypes and genetic data from a reference panel. We used the Genetic Analysis Workshop 16 rheumatoid arthritis case-control dataset to compare the performance of four of these imputation methods: IMPUTE, MACH, PLINK, and fastPHASE. We compared the methods' imputation error rates and performance of association tests using the imputed data, in the context of imputing completely untyped markers as well as imputing missing genotypes to combine two datasets genotyped at different sets of markers. As expected, all methods performed better for single-nucleotide polymorphisms (SNPs) in high linkage disequilibrium with genotyped SNPs. However, MACH and IMPUTE generated lower imputation error rates than fastPHASE and PLINK. Association tests based on allele "dosage" from MACH and tests based on the posterior probabilities from IMPUTE provided results closest to those based on complete data. However, in both situations, none of the imputation-based tests provide the same level of evidence of association as the complete data at SNPs strongly associated with disease
    corecore