24 research outputs found

    A fast algorithm for detecting gene-gene interactions in genome-wide association studies

    Full text link
    With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene-gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP-SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene-gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS771 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian information criterion

    Full text link
    The prevailing method of analyzing GWAS data is still to test each marker individually, although from a statistical point of view it is quite obvious that in case of complex traits such single marker tests are not ideal. Recently several model selection approaches for GWAS have been suggested, most of them based on LASSO-type procedures. Here we will discuss an alternative model selection approach which is based on a modification of the Bayesian Information Criterion (mBIC2) which was previously shown to have certain asymptotic optimality properties in terms of minimizing the misclassification error. Heuristic search strategies are introduced which attempt to find the model which minimizes mBIC2, and which are efficient enough to allow the analysis of GWAS data. Our approach is implemented in a software package called MOSGWA. Its performance in case control GWAS is compared with the two algorithms HLASSO and GWASelect, as well as with single marker tests, where we performed a simulation study based on real SNP data from the POPRES sample. Our results show that MOSGWA performs slightly better than HLASSO, whereas according to our simulations GWASelect does not control the type I error when used to automatically determine the number of important SNPs. We also reanalyze the GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC) and compare the findings of the different procedures

    General Framework for Meta-Analysis of Haplotype Association Tests

    Get PDF
    For complex traits, most associated single nucleotide variants (SNV) discovered to date have a small effect, and detection of association is only possible with large sample sizes. Because of patient confidentiality concerns, it is often not possible to pool genetic data from multiple cohorts, and meta-analysis has emerged as the method of choice to combine results from multiple studies. Many meta-analysis methods are available for single SNV analyses. As new approaches allow the capture of low frequency and rare genetic variation, it is of interest to jointly consider multiple variants to improve power. However, for the analysis of haplotypes formed by multiple SNVs, meta-analysis remains a challenge, because different haplotypes may be observed across studies. We propose a two-stage meta-analysis approach to combine haplotype analysis results. In the first stage, each cohort estimate haplotype effect sizes in a regression framework, accounting for relatedness among observations if appropriate. For the second stage, we use a multivariate generalized least square meta-analysis approach to combine haplotype effect estimates from multiple cohorts. Haplotype-specific association tests and a global test of independence between haplotypes and traits are obtained within our framework. We demonstrate through simulation studies that we control the type-I error rate, and our approach is more powerful than inverse variance weighted meta-analysis of single SNV analysis when haplotype effects are present. We replicate a published haplotype association between fasting glucose-associated locus (G6PC2) and fasting glucose in seven studies from the Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium and we provide more precise haplotype effect estimates.Generation Scotland: Generation Scotland received core funding from the Chief Scientist Office of the Scottish Government Health Directorate CZD/16/6 and the Scottish Funding Council HR03006. Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the Wellcome Trust Clinical Research Facility, Edinburgh, Scotland, and was funded by the UKâs Medical Research Council. Ethics approval for the study was given by the NHS Tayside committee on research ethics (reference 05/S1401/89). We are grateful to all the families who took part, the general practitioners and the Scottish School of Primary Care for their help in recruiting them, and the whole Generation Scotland team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists, healthcare assistants and nurses. FamHS: Family Heart Study was supported by NIH grants RO1-HL-087700 and RO1-HL-088215 (M.A.P., PI) from NHLBI, and RO1-DK-8925601 and RO1-DK-075681 (I.B.B., PI) from NIDDK. MESA: MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-001079, and UL1-TR-000040. Funding for SHARe genotyping was provided by NHLBI contract N02-HL-64278. Funding for MESA Family was provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258, R01-HL-071259, and UL1-RR-025005. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR000124, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. FHS: Framingham Heart Study—Genotyping, quality control, and calling of the Illumina HumanExome BeadChip in the Framingham Heart Study was supported by funding from the National Heart, Lung and Blood Institute, Division of Intramural Research (Daniel Levy and Christopher J. OâDonnell, Principle Investigators). A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resources at Boston University Medical Campus. Also supported by National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK) R01 DK078616, NIDDK K24 DK080140, and American Diabetes Association Mentor-Based Postdoctoral Fellowship Award #7-09-MN-32, all to Dr. Meigs. FENLAND: The Fenland Study is funded by the Medical Research Council (MC_U106179471) and Wellcome Trust. We are grateful to all the volunteers for their time and help, and to the General Practitioners and practice staff for assistance with recruitment. We thank the Fenland Study Investigators, Fenland Study Co-ordination team, and the Epidemiology Field, Data and Laboratory teams. EPIC-Potsdam: We thank all EPIC-Potsdam participants for their invaluable contribution to the study. The study was supported in part by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center for Diabetes Research (DZD e.V.). The recruitment phase of the EPIC-Potsdam study was supported by the Federal Ministry of Science, Germany (01 EA 9401) and the European Union (SOC 95201408 05 F02). The follow-up of the EPIC-Potsdam study was supported by German Cancer Aid (70-2488-Ha I) and the European Community (SOC 98200769 05 F02). Furthermore, we thank Dr. Manuela Bergmann who was responsible for the methodological and organizational work of data collections of exposures and outcomes and Wolfgang Fleischhauer for his medical expertise that was employed in case ascertainment and contacts with the physicians and Ellen Kohlsdorf for data management. CHS: This CHS research was supported by NHLBI contracts HHSN268201200036C, HHSN268200800007C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086; and NHLBI grants HL080295, HL087652, HL103612, HL068986 with additional contribution from the National Institute of Neurological Disorders and Stroke (NINDS). Additional support was provided through AG023629 from the National Institute on Aging (NIA). A full list of CHS investigators and institutions can be found at http://www.chs-nhlbi.org/pi.htm. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR000124, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.This is the final version of the article. It first appeared from Wiley via http://dx.doi.org/10.1002/gepi.2195

    Boosting and lassoing new prostate cancer SNP risk factors and their connection to selenium

    Get PDF
    We begin by arguing that the often used algorithm for the discovery and use of disease risk factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the use of the protocol with a set of prostate cancer data and show that it recovers known risk factors. Finally, we use the protocol to identify new and important SNP based risk factors for prostate cancer and further seek evidence for or against the hypothesis of an anticancer function for Selenium in prostate cancer. We find that the anticancer effect may depend on the SNP-SNP interaction and, in particular, which alleles are present

    Discovering candidate SNPs for resilience breeding of red clover

    Get PDF
    Red clover is a highly valuable crop for the ruminant industry in the temperate regions worldwide. It also provides multiple environmental services, such as contribution to increased soil fertility and reduced soil erosion. This study used 661 single nucleotide polymorphism (SNP) markers via targeted sequencing using seqSNP, to describe genetic diversity and population structure in 382 red clover accessions. The accessions were selected from NordGen representing red clover germplasm from Norway, Sweden, Finland and Denmark as well as from Lantmännen, a Swedish seed company. Each accession was represented by 10 individuals, which was sequenced as a pool. The mean Nei’s standard genetic distance between the accessions and genetic variation within accessions were 0.032 and 0.18, respectively. The majority of the accessions had negative Tajima’s D, suggesting that they contain significant proportions of rare alleles. A pairwise FST revealed high genetic similarity between the different cultivated types, while the wild populations were divergent. Unlike wild populations, which exhibited genetic differentiation, there was no clear differentiation among all cultivated types. A principal coordinate analysis revealed that the first principal coordinate, distinguished most of the wild populations from the cultivated types, in agreement with the results obtained using a discriminant analysis of principal components and cluster analysis. Accessions of wild populations and landraces collected from southern and central Scandinavia showed a higher genetic similarity to Lantmännen accessios. It is therefore possible to link the diversity of the environments where wild populations were collected to the genetic diversity of the cultivated and wild gene pools. Additionally, least absolute shrinkage and selection operator (LASSO) models revealed associations between variation in temperature and precipitation and SNPs within genes controlling stomatal opening. Temperature was also related to kinase proteins, which are known to regulate plant response to temperature stress. Furthermore, the variation between wild populations and cultivars was correlated with SNPs within genes regulating root development. Overall, this study comprehensively investigated Nordic European red clover germplasm, and the results provide forage breeders with valuable information for further selection and development of red clover cultivars

    Boosting and lassoing new prostate cancer SNP risk factors and their connection to selenium

    Get PDF
    We begin by arguing that the often used algorithm for the discovery and use of disease risk factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the use of the protocol with a set of prostate cancer data and show that it recovers known risk factors. Finally, we use the protocol to identify new and important SNP based risk factors for prostate cancer and further seek evidence for or against the hypothesis of an anticancer function for Selenium in prostate cancer. We find that the anticancer effect may depend on the SNP-SNP interaction and, in particular, which alleles are present
    corecore