106 research outputs found
Asymptotic properties of false discovery rate controlling procedures under independence
We investigate the performance of a family of multiple comparison procedures
for strong control of the False Discovery Rate (). The
is the expected False Discovery Proportion (),
that is, the expected fraction of false rejections among all rejected
hypotheses. A number of refinements to the original Benjamini-Hochberg
procedure [1] have been proposed, to increase power by estimating the
proportion of true null hypotheses, either implicitly, leading to one-stage
adaptive procedures [4, 7] or explicitly, leading to two-stage adaptive (or
plug-in) procedures [2, 21]. We use a variant of the stochastic process
approach proposed by Genovese and Wasserman [11] to study the fluctuations of
the achieved with each of these procedures around its
expectation, for independent tested hypotheses. We introduce a framework for
the derivation of generic Central Limit Theorems for the of
these procedures, characterizing the associated regularity conditions, and
comparing the asymptotic power of the various procedures. We interpret recently
proposed one-stage adaptive procedures [4, 7] as fixed points in the iteration
of well known two-stage adaptive procedures [2, 21].Comment: Published in at http://dx.doi.org/10.1214/08-EJS207 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
On false discovery rate thresholding for classification under sparsity
We study the properties of false discovery rate (FDR) thresholding, viewed as
a classification procedure. The "0"-class (null) is assumed to have a known
density while the "1"-class (alternative) is obtained from the "0"-class either
by translation or by scaling. Furthermore, the "1"-class is assumed to have a
small number of elements w.r.t. the "0"-class (sparsity). We focus on densities
of the Subbotin family, including Gaussian and Laplace models. Nonasymptotic
oracle inequalities are derived for the excess risk of FDR thresholding. These
inequalities lead to explicit rates of convergence of the excess risk to zero,
as the number m of items to be classified tends to infinity and in a regime
where the power of the Bayes rule is away from 0 and 1. Moreover, these
theoretical investigations suggest an explicit choice for the target level
of FDR thresholding, as a function of m. Our oracle inequalities
show theoretically that the resulting FDR thresholding adapts to the unknown
sparsity regime contained in the data. This property is illustrated with
numerical experiments
Performance evaluation of DNA copy number segmentation methods
A number of bioinformatic or biostatistical methods are available for
analyzing DNA copy number profiles measured from microarray or sequencing
technologies. In the absence of rich enough gold standard data sets, the
performance of these methods is generally assessed using unrealistic simulation
studies, or based on small real data analyses. We have designed and implemented
a framework to generate realistic DNA copy number profiles of cancer samples
with known truth. These profiles are generated by resampling real SNP
microarray data from genomic regions with known copy-number state. The original
real data have been extracted from dilutions series of tumor cell lines with
matched blood samples at several concentrations. Therefore, the signal-to-noise
ratio of the generated profiles can be controlled through the (known)
percentage of tumor cells in the sample. In this paper, we describe this
framework and illustrate some of the benefits of the proposed data generation
approach on a practical use case: a comparison study between methods for
segmenting DNA copy number profiles from SNP microarrays. This study indicates
that no single method is uniformly better than all others. It also helps
identifying pros and cons for the compared methods as a function of
biologically informative parameters, such as the fraction of tumor cells in the
sample and the proportion of heterozygous markers. Availability: R package
jointSeg: http://r-forge.r-project.org/R/?group\_id=156
Gains in Power from Structured Two-Sample Tests of Means on Graphs
We consider multivariate two-sample tests of means, where the location shift
between the two populations is expected to be related to a known graph
structure. An important application of such tests is the detection of
differentially expressed genes between two patient populations, as shifts in
expression levels are expected to be coherent with the structure of graphs
reflecting gene properties such as biological process, molecular function,
regulation, or metabolism. For a fixed graph of interest, we demonstrate that
accounting for graph structure can yield more powerful tests under the
assumption of smooth distribution shift on the graph. We also investigate the
identification of non-homogeneous subgraphs of a given large graph, which poses
both computational and multiple testing problems. The relevance and benefits of
the proposed approach are illustrated on synthetic data and on breast cancer
gene expression data analyzed in context of KEGG pathways
Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators
The False Discovery Rate (FDR) is a commonly used type I error rate in
multiple testing problems. It is defined as the expected False Discovery
Proportion (FDP), that is, the expected fraction of false positives among
rejected hypotheses. When the hypotheses are independent, the
Benjamini-Hochberg procedure achieves FDR control at any pre-specified level.
By construction, FDR control offers no guarantee in terms of power, or type II
error. A number of alternative procedures have been developed, including
plug-in procedures that aim at gaining power by incorporating an estimate of
the proportion of true null hypotheses. In this paper, we study the asymptotic
behavior of a class of plug-in procedures based on kernel estimators of the
density of the -values, as the number of tested hypotheses grows to
infinity. In a setting where the hypotheses tested are independent, we prove
that these procedures are asymptotically more powerful in two respects: (i) a
tighter asymptotic FDR control for any target FDR level and (ii) a broader
range of target levels yielding positive asymptotic power. We also show that
this increased asymptotic power comes at the price of slower, non-parametric
convergence rates for the FDP. These rates are of the form ,
where is determined by the regularity of the density of the -value
distribution, or, equivalently, of the test statistics distribution. These
results are applied to one- and two-sided tests statistics for Gaussian and
Laplace location models, and for the Student model
On agnostic post hoc approaches to false positive control
This document is a book chapter which gives a partial survey on post hoc approaches to false positive control
Selective inference after convex clustering with penalization
Classical inference methods notoriously fail when applied to data-driven test
hypotheses or inference targets. Instead, dedicated methodologies are required
to obtain statistical guarantees for these selective inference problems.
Selective inference is particularly relevant post-clustering, typically when
testing a difference in mean between two clusters. In this paper, we address
convex clustering with penalization, by leveraging related selective
inference tools for regression, based on Gaussian vectors conditioned to
polyhedral sets. In the one-dimensional case, we prove a polyhedral
characterization of obtaining given clusters, than enables us to suggest a test
procedure with statistical guarantees. This characterization also allows us to
provide a computationally efficient regularization path algorithm. Then, we
extend the above test procedure and guarantees to multi-dimensional clustering
with penalization, and also to more general multi-dimensional
clusterings that aggregate one-dimensional ones. With various numerical
experiments, we validate our statistical guarantees and we demonstrate the
power of our methods to detect differences in mean between clusters. Our
methods are implemented in the R package poclin.Comment: 40 pages, 8 figure
FDP control in multivariate linear models using the bootstrap
In this article we develop a method for performing post hoc inference of the
False Discovery Proportion (FDP) over multiple contrasts of interest in the
multivariate linear model. To do so we use the bootstrap to simulate from the
distribution of the null contrasts. We combine the bootstrap with the post hoc
inference bounds of Blanchard (2020) and prove that doing so provides
simultaneous asymptotic control of the FDP over all subsets of hypotheses. This
requires us to demonstrate consistency of the multivariate bootstrap in the
linear model, which we do via the Lindeberg Central Limit Theorem, providing a
simpler proof of this result than that of Eck (2018). We demonstrate, via
simulations, that our approach provides simultaneous control of the FDP over
all subsets and is typically more powerful than existing, state of the art,
parametric methods. We illustrate our approach on functional Magnetic Resonance
Imaging data from the Human Connectome project and on a transcriptomic dataset
of chronic obstructive pulmonary disease
Post-clustering Inference under Dependency
Recent work by Gao et al. has laid the foundations for post-clustering
inference. For the first time, the authors established a theoretical framework
allowing to test for differences between means of estimated clusters.
Additionally, they studied the estimation of unknown parameters while
controlling the selective type I error. However, their theory was developed for
independent observations identically distributed as -dimensional Gaussian
variables with a spherical covariance matrix. Here, we aim at extending this
framework to a more convenient scenario for practical applications, where
arbitrary dependence structures between observations and features are allowed.
We show that a -value for post-clustering inference under general dependency
can be defined, and we assess the theoretical conditions allowing the
compatible estimation of a covariance matrix. The theory is developed for
hierarchical agglomerative clustering algorithms with several types of
linkages, and for the -means algorithm. We illustrate our method with
synthetic data and real data of protein structures
- âŠ