50 research outputs found
Exact calculations for false discovery proportion with application to least favorable configurations
In a context of multiple hypothesis testing, we provide several new exact
calculations related to the false discovery proportion (FDP) of step-up and
step-down procedures. For step-up procedures, we show that the number of
erroneous rejections conditionally on the rejection number is simply a binomial
variable, which leads to explicit computations of the c.d.f., the {-th}
moment and the mean of the FDP, the latter corresponding to the false discovery
rate (FDR). For step-down procedures, we derive what is to our knowledge the
first explicit formula for the FDR valid for any alternative c.d.f. of the
-values. We also derive explicit computations of the power for both step-up
and step-down procedures. These formulas are "explicit" in the sense that they
only involve the parameters of the model and the c.d.f. of the order statistics
of i.i.d. uniform variables. The -values are assumed either independent or
coming from an equicorrelated multivariate normal model and an additional
mixture model for the true/false hypotheses is used. This new approach is used
to investigate new results which are of interest in their own right, related to
least/most favorable configurations for the FDR and the variance of the FDP
On the false discovery proportion convergence under Gaussian equi-correlation
We study the convergence of the false discovery proportion (FDP) of the
Benjamini-Hochberg procedure in the Gaussian equi-correlated model, when the
correlation converges to zero as the hypothesis number grows to
infinity. By contrast with the standard convergence rate holding
under independence, this study shows that the FDP converges to the false
discovery rate (FDR) at rate in this
equi-correlated model
Two simple sufficient conditions for FDR control
We show that the control of the false discovery rate (FDR) for a multiple
testing procedure is implied by two coupled simple sufficient conditions. The
first one, which we call ``self-consistency condition'', concerns the algorithm
itself, and the second, called ``dependency control condition'' is related to
the dependency assumptions on the -value family. Many standard multiple
testing procedures are self-consistent (e.g. step-up, step-down or step-up-down
procedures), and we prove that the dependency control condition can be
fulfilled when choosing correspondingly appropriate rejection functions, in
three classical types of dependency: independence, positive dependency (PRDS)
and unspecified dependency. As a consequence, we recover earlier results
through simple and unifying proofs while extending their scope to several
regards: weighted FDR, -value reweighting, new family of step-up procedures
under unspecified -value dependency and adaptive step-up procedures. We give
additional examples of other possible applications. This framework also allows
for defining and studying FDR control for multiple testing procedures over a
continuous, uncountable space of hypotheses.Comment: Published in at http://dx.doi.org/10.1214/08-EJS180 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
On false discovery rate thresholding for classification under sparsity
We study the properties of false discovery rate (FDR) thresholding, viewed as
a classification procedure. The "0"-class (null) is assumed to have a known
density while the "1"-class (alternative) is obtained from the "0"-class either
by translation or by scaling. Furthermore, the "1"-class is assumed to have a
small number of elements w.r.t. the "0"-class (sparsity). We focus on densities
of the Subbotin family, including Gaussian and Laplace models. Nonasymptotic
oracle inequalities are derived for the excess risk of FDR thresholding. These
inequalities lead to explicit rates of convergence of the excess risk to zero,
as the number m of items to be classified tends to infinity and in a regime
where the power of the Bayes rule is away from 0 and 1. Moreover, these
theoretical investigations suggest an explicit choice for the target level
of FDR thresholding, as a function of m. Our oracle inequalities
show theoretically that the resulting FDR thresholding adapts to the unknown
sparsity regime contained in the data. This property is illustrated with
numerical experiments
Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests
We study generalized bootstrap confidence regions for the mean of a random
vector whose coordinates have an unknown dependency structure. The random
vector is supposed to be either Gaussian or to have a symmetric and bounded
distribution. The dimensionality of the vector can possibly be much larger than
the number of observations and we focus on a nonasymptotic control of the
confidence level, following ideas inspired by recent results in learning
theory. We consider two approaches, the first based on a concentration
principle (valid for a large class of resampling weights) and the second on a
resampled quantile, specifically using Rademacher weights. Several intermediate
results established in the approach based on concentration principles are of
interest in their own right. We also discuss the question of accuracy when
using Monte Carlo approximations of the resampled quantities.Comment: Published in at http://dx.doi.org/10.1214/08-AOS667;
http://dx.doi.org/10.1214/08-AOS668 the Annals of Statistics
(http://www.imstat.org/aos/) by the Institute of Mathematical Statistics
(http://www.imstat.org
Continuous testing for Poisson process intensities: A new perspective on scanning statistics
We propose a novel continuous testing framework to test the intensities of
Poisson Processes. This framework allows a rigorous definition of the complete
testing procedure, from an infinite number of hypothesis to joint error rates.
Our work extends traditional procedures based on scanning windows, by
controlling the family-wise error rate and the false discovery rate in a
non-asymptotic manner and in a continuous way. The decision rule is based on a
\pvalue process that can be estimated by a Monte-Carlo procedure. We also
propose new test statistics based on kernels. Our method is applied in
Neurosciences and Genomics through the standard test of homogeneity, and the
two-sample test
On empirical distribution function of high-dimensional Gaussian vector components with an application to multiple testing
This paper introduces a new framework to study the asymptotical behavior of the empirical distribution function (e.d.f.) of Gaussian vector components, whose correlation matrix is dimension-dependent. Hence, by contrast with the existing literature, the vector is not assumed to be stationary. Rather, we make a ''vanishing second order" assumption ensuring that the covariance matrix is not too far from the identity matrix, while the behavior of the e.d.f. is affected by only through the sequence , as grows to infinity. This result recovers some of the previous results for stationary long-range dependencies while it also applies to various, high-dimensional, non-stationary frameworks, for which the most correlated variables are not necessarily next to each other. Finally, we present an application of this work to the multiple testing problem, which was the initial statistical motivation for developing such a methodology
Contributions to multiple testing theory for high-dimensional data
Rapporteurs: Yoav Benjamini; SteÌphane Robin; Larry Wasserman; Michael WolfThis manuscript provides a mathematical study of the multiple testing problem in settings motivated by modern applications, for which the number of variables is much larger than the sample size. As we will see, this problem highly depends on the nature of the data and of the desired interpretation of the results.Chapter 1 is a wide introduction to the multiple testing theme which is intended to be accessible for a possibly non-specialist reader and which includes a presentation of some high-dimensional genomic data. The necessary probabilistic materials are then introduced in Chapter 2, while Chapters 3, 4, 5 and 6 are guided by the findings [P2, P3, P4, P5, P6, P7, P8, P9, P10, P11, P12, P15, P16] listed page 81. Nevertheless, compared to the original papers, I have tried to simplify and unify the studies as much as possible. An effort has also be done for presenting self-contained proofs when possible. Let us also mention that, as an upstream work, I have proposed a survey paper in [P13]. Nevertheless, the overlap between that work and the present manuscript turns out to be only minor.The publications [P1, P2, P3, P6, P7, P14] correspond to a research (essentially) carried out during my PhD period at the Paris-Sud University and INRA1. The papers [P15, P11] are related to my postdoctoral position at the VU University Amsterdam. I have elaborated the work [P4, P5, P8, P9, P10, P12, P13, P16] afterwards, as a âmaiÌtre de confeÌrencesâ at the Pierre et Marie Curie University in Paris.Throughout this manuscript, we will see that while the multiple testing problem occurs in various practical and concrete situations, it relies on an astonishingly wide variety of theoretical concepts, as combinatorics, resampling, empirical processes, concentration inequalities, positive dependence, among others. This symbiosis between theory and practice explains the worldwide success of the multiple testing research field, which has become a prominent research area of contemporary statistics