394 research outputs found
Rodeo: Sparse Nonparametric Regression in High Dimensions
We present a greedy method for simultaneously performing local bandwidth
selection and variable selection in nonparametric regression. The method starts
with a local linear estimator with large bandwidths, and incrementally
decreases the bandwidth of variables for which the gradient of the estimator
with respect to bandwidth is large. The method--called rodeo (regularization of
derivative expectation operator)--conducts a sequence of hypothesis tests to
threshold derivatives, and is easy to implement. Under certain assumptions on
the regression function and sampling density, it is shown that the rodeo
applied to local linear smoothing avoids the curse of dimensionality, achieving
near optimal minimax rates of convergence in the number of relevant variables,
as if these variables were isolated in advance
A stochastic process approach to false discovery control
This paper extends the theory of false discovery rates (FDR) pioneered by
Benjamini and Hochberg [J. Roy. Statist. Soc. Ser. B 57 (1995) 289-300].
We develop a framework in which the False Discovery Proportion (FDP)--the
number of false rejections divided by the number of rejections--is treated as a
stochastic process. After obtaining the limiting distribution of the process,
we demonstrate the validity of a class of procedures for controlling the False
Discovery Rate (the expected FDP). We construct a confidence envelope for the
whole FDP process. From these envelopes we derive confidence thresholds, for
controlling the quantiles of the distribution of the FDP as well as controlling
the number of false discoveries. We also investigate methods for estimating the
p-value distribution
Generalized density clustering
We study generalized density-based clustering in which sharply defined
clusters such as clusters on lower-dimensional manifolds are allowed. We show
that accurate clustering is possible even in high dimensions. We propose two
data-based methods for choosing the bandwidth and we study the stability
properties of density clusters. We show that a simple graph-based algorithm
successfully approximates the high density clusters.Comment: Published in at http://dx.doi.org/10.1214/10-AOS797 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Genome-Wide Significance Levels and Weighted Hypothesis Testing
Genetic investigations often involve the testing of vast numbers of related
hypotheses simultaneously. To control the overall error rate, a substantial
penalty is required, making it difficult to detect signals of moderate
strength. To improve the power in this setting, a number of authors have
considered using weighted -values, with the motivation often based upon the
scientific plausibility of the hypotheses. We review this literature, derive
optimal weights and show that the power is remarkably robust to
misspecification of these weights. We consider two methods for choosing weights
in practice. The first, external weighting, is based on prior information. The
second, estimated weighting, uses the data to choose weights.Comment: Published in at http://dx.doi.org/10.1214/09-STS289 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
High-dimensional variable selection
This paper explores the following question: what kind of statistical
guarantees can be given when doing variable selection in high-dimensional
models? In particular, we look at the error rates and power of some multi-stage
regression methods. In the first stage we fit a set of candidate models. In the
second stage we select one model by cross-validation. In the third stage we use
hypothesis testing to eliminate some variables. We refer to the first two
stages as "screening" and the last stage as "cleaning." We consider three
screening methods: the lasso, marginal regression, and forward stepwise
regression. Our method gives consistent variable selection under certain
conditions.Comment: Published in at http://dx.doi.org/10.1214/08-AOS646 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …