11,384 research outputs found
Methodological Issues in Multistage Genome-Wide Association Studies
Because of the high cost of commercial genotyping chip technologies, many
investigations have used a two-stage design for genome-wide association
studies, using part of the sample for an initial discovery of ``promising''
SNPs at a less stringent significance level and the remainder in a joint
analysis of just these SNPs using custom genotyping. Typical cost savings of
about 50% are possible with this design to obtain comparable levels of overall
type I error and power by using about half the sample for stage I and carrying
about 0.1% of SNPs forward to the second stage, the optimal design depending
primarily upon the ratio of costs per genotype for stages I and II. However,
with the rapidly declining costs of the commercial panels, the generally low
observed ORs of current studies, and many studies aiming to test multiple
hypotheses and multiple endpoints, many investigators are abandoning the
two-stage design in favor of simply genotyping all available subjects using a
standard high-density panel. Concern is sometimes raised about the absence of a
``replication'' panel in this approach, as required by some high-profile
journals, but it must be appreciated that the two-stage design is not a
discovery/replication design but simply a more efficient design for discovery
using a joint analysis of the data from both stages. Once a subset of
highly-significant associations has been discovered, a truly independent
``exact replication'' study is needed in a similar population of the same
promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Genome-Wide Significance Levels and Weighted Hypothesis Testing
Genetic investigations often involve the testing of vast numbers of related
hypotheses simultaneously. To control the overall error rate, a substantial
penalty is required, making it difficult to detect signals of moderate
strength. To improve the power in this setting, a number of authors have
considered using weighted -values, with the motivation often based upon the
scientific plausibility of the hypotheses. We review this literature, derive
optimal weights and show that the power is remarkably robust to
misspecification of these weights. We consider two methods for choosing weights
in practice. The first, external weighting, is based on prior information. The
second, estimated weighting, uses the data to choose weights.Comment: Published in at http://dx.doi.org/10.1214/09-STS289 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
SLOPE - Adaptive variable selection via convex optimization
We introduce a new estimator for the vector of coefficients in the
linear model , where has dimensions with
possibly larger than . SLOPE, short for Sorted L-One Penalized Estimation,
is the solution to where
and are the
decreasing absolute values of the entries of . This is a convex program and
we demonstrate a solution algorithm whose computational complexity is roughly
comparable to that of classical procedures such as the Lasso. Here,
the regularizer is a sorted norm, which penalizes the regression
coefficients according to their rank: the higher the rank - that is, stronger
the signal - the larger the penalty. This is similar to the Benjamini and
Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300] procedure (BH) which
compares more significant -values with more stringent thresholds. One
notable choice of the sequence is given by the BH critical
values , where and
is the quantile of a standard normal distribution. SLOPE aims to
provide finite sample guarantees on the selected model; of special interest is
the false discovery rate (FDR), defined as the expected proportion of
irrelevant regressors among all selected predictors. Under orthogonal designs,
SLOPE with provably controls FDR at level .
Moreover, it also appears to have appreciable inferential properties under more
general designs while having substantial power, as demonstrated in a series
of experiments running on both simulated and real data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS842 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
False discovery rates in somatic mutation studies of cancer
The purpose of cancer genome sequencing studies is to determine the nature
and types of alterations present in a typical cancer and to discover genes
mutated at high frequencies. In this article we discuss statistical methods for
the analysis of somatic mutation frequency data generated in these studies. We
place special emphasis on a two-stage study design introduced by Sj\"{o}blom et
al. [Science 314 (2006) 268--274]. In this context, we describe and compare
statistical methods for constructing scores that can be used to prioritize
candidate genes for further investigation and to assess the statistical
significance of the candidates thus identified. Controversy has surrounded the
reliability of the false discovery rates estimates provided by the
approximations used in early cancer genome studies. To address these, we
develop a semiparametric Bayesian model that provides an accurate fit to the
data. We use this model to generate a large collection of realistic scenarios,
and evaluate alternative approaches on this collection. Our assessment is
impartial in that the model used for generating data is not used by any of the
approaches compared. And is objective, in that the scenarios are generated by a
model that fits data. Our results quantify the conservative control of the
false discovery rate with the Benjamini and Hockberg method compared to the
empirical Bayes approach and the multiple testing method proposed in Storey [J.
R. Stat. Soc. Ser. B Stat. Methodol. 64 (2002) 479--498]. Simulation results
also show a negligible departure from the target false discovery rate for the
methodology used in Sj\"{o}blom et al. [Science 314 (2006) 268--274].Comment: Published in at http://dx.doi.org/10.1214/10-AOAS438 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …