8,707 research outputs found
Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing
We consider the problems of hypothesis testing and model comparison under a
flexible Bayesian linear regression model whose formulation is closely
connected with the linear mixed effect model and the parametric models for SNP
set analysis in genetic association studies. We derive a class of analytic
approximate Bayes factors and illustrate their connections with a variety of
frequentist test statistics, including the Wald statistic and the variance
component score statistic. Taking advantage of Bayesian model averaging and
hierarchical modeling, we demonstrate some distinct advantages and
flexibilities in the approaches utilizing the derived Bayes factors in the
context of genetic association studies. We demonstrate our proposed methods
using real or simulated numerical examples in applications of single SNP
association testing, multi-locus fine-mapping and SNP set association testing
Effective Genetic Risk Prediction Using Mixed Models
To date, efforts to produce high-quality polygenic risk scores from
genome-wide studies of common disease have focused on estimating and
aggregating the effects of multiple SNPs. Here we propose a novel statistical
approach for genetic risk prediction, based on random and mixed effects models.
Our approach (termed GeRSI) circumvents the need to estimate the effect sizes
of numerous SNPs by treating these effects as random, producing predictions
which are consistently superior to current state of the art, as we demonstrate
in extensive simulation. When applying GeRSI to seven phenotypes from the WTCCC
study, we confirm that the use of random effects is most beneficial for
diseases that are known to be highly polygenic: hypertension (HT) and bipolar
disorder (BD). For HT, there are no significant associations in the WTCCC data.
The best existing model yields an AUC of 54%, while GeRSI improves it to 59%.
For BD, using GeRSI improves the AUC from 55% to 62%. For individuals ranked at
the top 10% of BD risk predictions, using GeRSI substantially increases the BD
relative risk from 1.4 to 2.5.Comment: main text: 14 pages, 3 figures. Supplementary text: 16 pages, 21
figure
Replication in Genome-Wide Association Studies
Replication helps ensure that a genotype-phenotype association observed in a
genome-wide association (GWA) study represents a credible association and is
not a chance finding or an artifact due to uncontrolled biases. We discuss
prerequisites for exact replication, issues of heterogeneity, advantages and
disadvantages of different methods of data synthesis across multiple studies,
frequentist vs. Bayesian inferences for replication, and challenges that arise
from multi-team collaborations. While consistent replication can greatly
improve the credibility of a genotype-phenotype association, it may not
eliminate spurious associations due to biases shared by many studies.
Conversely, lack of replication in well-powered follow-up studies usually
invalidates the initially proposed association, although occasionally it may
point to differences in linkage disequilibrium or effect modifiers across
studies.Comment: Published in at http://dx.doi.org/10.1214/09-STS290 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Methodological Issues in Multistage Genome-Wide Association Studies
Because of the high cost of commercial genotyping chip technologies, many
investigations have used a two-stage design for genome-wide association
studies, using part of the sample for an initial discovery of ``promising''
SNPs at a less stringent significance level and the remainder in a joint
analysis of just these SNPs using custom genotyping. Typical cost savings of
about 50% are possible with this design to obtain comparable levels of overall
type I error and power by using about half the sample for stage I and carrying
about 0.1% of SNPs forward to the second stage, the optimal design depending
primarily upon the ratio of costs per genotype for stages I and II. However,
with the rapidly declining costs of the commercial panels, the generally low
observed ORs of current studies, and many studies aiming to test multiple
hypotheses and multiple endpoints, many investigators are abandoning the
two-stage design in favor of simply genotyping all available subjects using a
standard high-density panel. Concern is sometimes raised about the absence of a
``replication'' panel in this approach, as required by some high-profile
journals, but it must be appreciated that the two-stage design is not a
discovery/replication design but simply a more efficient design for discovery
using a joint analysis of the data from both stages. Once a subset of
highly-significant associations has been discovered, a truly independent
``exact replication'' study is needed in a similar population of the same
promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Accurate modeling of confounding variation in eQTL studies leads to a great increase in power to detect trans-regulatory effects
Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown environmental influences. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. 

Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an
eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, PANAMA can more accurately distinguish between true genetic association signals and confounding variation. 

We applied our model and compared it to existing methods on a variety of datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, PANAMA not only identified a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies
Recommended from our members
Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes.
We aggregated coding variant data for 81,412 type 2 diabetes cases and 370,832 controls of diverse ancestry, identifying 40 coding variant association signals (P < 2.2 × 10-7); of these, 16 map outside known risk-associated loci. We make two important observations. First, only five of these signals are driven by low-frequency variants: even for these, effect sizes are modest (odds ratio ≤1.29). Second, when we used large-scale genome-wide association data to fine-map the associated variants in their regional context, accounting for the global enrichment of complex trait associations in coding sequence, compelling evidence for coding variant causality was obtained for only 16 signals. At 13 others, the associated coding variants clearly represent 'false leads' with potential to generate erroneous mechanistic inference. Coding variant associations offer a direct route to biological insight for complex diseases and identification of validated therapeutic targets; however, appropriate mechanistic inference requires careful specification of their causal contribution to disease predisposition
- …