24 research outputs found
A fast algorithm for detecting gene-gene interactions in genome-wide association studies
With the recent advent of high-throughput genotyping techniques, genetic data
for genome-wide association studies (GWAS) have become increasingly available,
which entails the development of efficient and effective statistical
approaches. Although many such approaches have been developed and used to
identify single-nucleotide polymorphisms (SNPs) that are associated with
complex traits or diseases, few are able to detect gene-gene interactions among
different SNPs. Genetic interactions, also known as epistasis, have been
recognized to play a pivotal role in contributing to the genetic variation of
phenotypic traits. However, because of an extremely large number of SNP-SNP
combinations in GWAS, the model dimensionality can quickly become so
overwhelming that no prevailing variable selection methods are capable of
handling this problem. In this paper, we present a statistical framework for
characterizing main genetic effects and epistatic interactions in a GWAS study.
Specifically, we first propose a two-stage sure independence screening (TS-SIS)
procedure and generate a pool of candidate SNPs and interactions, which serve
as predictors to explain and predict the phenotypes of a complex trait. We also
propose a rates adjusted thresholding estimation (RATE) approach to determine
the size of the reduced model selected by an independence screening.
Regularization regression methods, such as LASSO or SCAD, are then applied to
further identify important genetic effects. Simulation studies show that the
TS-SIS procedure is computationally efficient and has an outstanding finite
sample performance in selecting potential SNPs as well as gene-gene
interactions. We apply the proposed framework to analyze an
ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select
23 active SNPs and 24 active epistatic interactions for the body mass index
variation. It shows the capability of our procedure to resolve the complexity
of genetic control.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS771 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Analyzing genome-wide association studies with an FDR controlling modification of the Bayesian information criterion
The prevailing method of analyzing GWAS data is still to test each marker
individually, although from a statistical point of view it is quite obvious
that in case of complex traits such single marker tests are not ideal. Recently
several model selection approaches for GWAS have been suggested, most of them
based on LASSO-type procedures. Here we will discuss an alternative model
selection approach which is based on a modification of the Bayesian Information
Criterion (mBIC2) which was previously shown to have certain asymptotic
optimality properties in terms of minimizing the misclassification error.
Heuristic search strategies are introduced which attempt to find the model
which minimizes mBIC2, and which are efficient enough to allow the analysis of
GWAS data.
Our approach is implemented in a software package called MOSGWA. Its
performance in case control GWAS is compared with the two algorithms HLASSO and
GWASelect, as well as with single marker tests, where we performed a simulation
study based on real SNP data from the POPRES sample. Our results show that
MOSGWA performs slightly better than HLASSO, whereas according to our
simulations GWASelect does not control the type I error when used to
automatically determine the number of important SNPs. We also reanalyze the
GWAS data from the Wellcome Trust Case-Control Consortium (WTCCC) and compare
the findings of the different procedures
General Framework for Meta-Analysis of Haplotype Association Tests
For complex traits, most associated single nucleotide variants (SNV) discovered to date have a small effect, and detection of association is only possible with large sample sizes. Because of patient confidentiality concerns, it is often not possible to pool genetic data from multiple cohorts, and meta-analysis has emerged as the method of choice to combine results from multiple studies. Many meta-analysis methods are available for single SNV analyses. As new approaches allow the capture of low frequency and rare genetic variation, it is of interest to jointly consider multiple variants to improve power. However, for the analysis of haplotypes formed by multiple SNVs, meta-analysis remains a challenge, because different haplotypes may be observed across studies. We propose a two-stage meta-analysis approach to combine haplotype analysis results. In the first stage, each cohort estimate haplotype effect sizes in a regression framework, accounting for relatedness among observations if appropriate. For the second stage, we use a multivariate generalized least square meta-analysis approach to combine haplotype effect estimates from multiple cohorts. Haplotype-specific association tests and a global test of independence between haplotypes and traits are obtained within our framework. We demonstrate through simulation studies that we control the type-I error rate, and our approach is more powerful than inverse variance weighted meta-analysis of single SNV analysis when haplotype effects are present. We replicate a published haplotype association between fasting glucose-associated locus (G6PC2) and fasting glucose in seven studies from the Cohorts for Heart and Aging Research in Genomic Epidemiology Consortium and we provide more precise haplotype effect estimates.Generation Scotland: Generation Scotland received core funding from the Chief Scientist Office of the Scottish Government Health Directorate CZD/16/6 and the Scottish Funding Council HR03006. Genotyping of the GS:SFHS samples was carried out by the Genetics Core Laboratory at the Wellcome Trust Clinical Research Facility, Edinburgh, Scotland, and was funded by the UKâs Medical Research Council. Ethics approval for the study was given by the NHS Tayside committee on research ethics (reference 05/S1401/89). We are grateful to all the families who took part, the general practitioners and the Scottish School of Primary Care for their help in recruiting them, and the whole Generation Scotland team, which includes interviewers, computer and laboratory technicians, clerical workers, research scientists, volunteers, managers, receptionists, healthcare assistants and nurses.
FamHS: Family Heart Study was supported by NIH grants RO1-HL-087700 and RO1-HL-088215 (M.A.P., PI) from NHLBI, and RO1-DK-8925601 and RO1-DK-075681 (I.B.B., PI) from NIDDK.
MESA: MESA and the MESA SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with MESA investigators. Support for MESA is provided by contracts N01-HC-95159, N01-HC-95160, N01-HC-95161, N01-HC-95162, N01-HC-95163, N01-HC-95164, N01-HC-95165, N01-HC-95166, N01-HC-95167, N01-HC-95168, N01-HC-95169, UL1-TR-001079, and UL1-TR-000040. Funding for SHARe genotyping was provided by NHLBI contract N02-HL-64278. Funding for MESA Family was provided by grants R01-HL-071051, R01-HL-071205, R01-HL-071250, R01-HL-071251, R01-HL-071252, R01-HL-071258, R01-HL-071259, and UL1-RR-025005. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR000124, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center.
FHS: Framingham Heart Study—Genotyping, quality control, and calling of the Illumina HumanExome BeadChip in the Framingham Heart Study was supported by funding from the National Heart, Lung and Blood Institute, Division of Intramural Research (Daniel Levy and Christopher J. OâDonnell, Principle Investigators). A portion of this research was conducted using the Linux Clusters for Genetic Analysis (LinGA) computing resources at Boston University Medical Campus. Also supported by National Institute for Diabetes and Digestive and Kidney Diseases (NIDDK) R01 DK078616, NIDDK K24 DK080140, and American Diabetes Association Mentor-Based Postdoctoral Fellowship Award #7-09-MN-32, all to Dr. Meigs.
FENLAND: The Fenland Study is funded by the Medical Research Council (MC_U106179471) and Wellcome Trust. We are grateful to all the volunteers for their time and help, and to the General Practitioners and practice staff for assistance with recruitment. We thank the Fenland Study Investigators, Fenland Study Co-ordination team, and the Epidemiology Field, Data and Laboratory teams.
EPIC-Potsdam: We thank all EPIC-Potsdam participants for their invaluable contribution to the study. The study was supported in part by a grant from the German Federal Ministry of Education and Research (BMBF) to the German Center for Diabetes Research (DZD e.V.). The recruitment phase of the EPIC-Potsdam study was supported by the Federal Ministry of Science, Germany (01 EA 9401) and the European Union (SOC 95201408 05 F02). The follow-up of the EPIC-Potsdam study was supported by German Cancer Aid (70-2488-Ha I) and the European Community (SOC 98200769 05 F02). Furthermore, we thank Dr. Manuela Bergmann who was responsible for the methodological and organizational work of data collections of exposures and outcomes and Wolfgang Fleischhauer for his medical expertise that was employed in case ascertainment and contacts with the physicians and Ellen Kohlsdorf for data management.
CHS: This CHS research was supported by NHLBI contracts HHSN268201200036C, HHSN268200800007C, N01HC55222, N01HC85079, N01HC85080, N01HC85081, N01HC85082, N01HC85083, N01HC85086; and NHLBI grants HL080295, HL087652, HL103612, HL068986 with additional contribution from the National Institute of Neurological Disorders and Stroke (NINDS). Additional support was provided through AG023629 from the National Institute on Aging (NIA). A full list of CHS investigators and institutions can be found at http://www.chs-nhlbi.org/pi.htm. The provision of genotyping data was supported in part by the National Center for Advancing Translational Sciences, CTSI grant UL1TR000124, and the National Institute of Diabetes and Digestive and Kidney Disease Diabetes Research Center (DRC) grant DK063491 to the Southern California Diabetes Endocrinology Research Center. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.This is the final version of the article. It first appeared from Wiley via http://dx.doi.org/10.1002/gepi.2195
Boosting and lassoing new prostate cancer SNP risk factors and their connection to selenium
We begin by arguing that the often used algorithm for the discovery and use of disease risk factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the use of the protocol with a set of prostate cancer data and show that it recovers known risk factors. Finally, we use the protocol to identify new and important SNP based risk factors for prostate cancer and further seek evidence for or against the hypothesis of an anticancer function for Selenium in prostate cancer. We find that the anticancer effect may depend on the SNP-SNP interaction and, in particular, which alleles are present
Discovering candidate SNPs for resilience breeding of red clover
Red clover is a highly valuable crop for the ruminant industry in the temperate regions worldwide. It also provides multiple environmental services, such as contribution to increased soil fertility and reduced soil erosion. This study used 661 single nucleotide polymorphism (SNP) markers via targeted sequencing using seqSNP, to describe genetic diversity and population structure in 382 red clover accessions. The accessions were selected from NordGen representing red clover germplasm from Norway, Sweden, Finland and Denmark as well as from Lantmännen, a Swedish seed company. Each accession was represented by 10 individuals, which was sequenced as a pool. The mean Nei’s standard genetic distance between the accessions and genetic variation within accessions were 0.032 and 0.18, respectively. The majority of the accessions had negative Tajima’s D, suggesting that they contain significant proportions of rare alleles. A pairwise FST revealed high genetic similarity between the different cultivated types, while the wild populations were divergent. Unlike wild populations, which exhibited genetic differentiation, there was no clear differentiation among all cultivated types. A principal coordinate analysis revealed that the first principal coordinate, distinguished most of the wild populations from the cultivated types, in agreement with the results obtained using a discriminant analysis of principal components and cluster analysis. Accessions of wild populations and landraces collected from southern and central Scandinavia showed a higher genetic similarity to Lantmännen accessios. It is therefore possible to link the diversity of the environments where wild populations were collected to the genetic diversity of the cultivated and wild gene pools. Additionally, least absolute shrinkage and selection operator (LASSO) models revealed associations between variation in temperature and precipitation and SNPs within genes controlling stomatal opening. Temperature was also related to kinase proteins, which are known to regulate plant response to temperature stress. Furthermore, the variation between wild populations and cultivars was correlated with SNPs within genes regulating root development. Overall, this study comprehensively investigated Nordic European red clover germplasm, and the results provide forage breeders with valuable information for further selection and development of red clover cultivars
Boosting and lassoing new prostate cancer SNP risk factors and their connection to selenium
We begin by arguing that the often used algorithm for the discovery and use of disease risk factors, stepwise logistic regression, is unstable. We then argue that there are other algorithms available that are much more stable and reliable (e.g. the lasso and gradient boosting). We then propose a protocol for the discovery and use of risk factors using lasso or boosting variable selection. We then illustrate the use of the protocol with a set of prostate cancer data and show that it recovers known risk factors. Finally, we use the protocol to identify new and important SNP based risk factors for prostate cancer and further seek evidence for or against the hypothesis of an anticancer function for Selenium in prostate cancer. We find that the anticancer effect may depend on the SNP-SNP interaction and, in particular, which alleles are present