Search CORE

106 research outputs found

Asymptotic properties of false discovery rate controlling procedures under independence

Author: Neuvial Pierre
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

We investigate the performance of a family of multiple comparison procedures for strong control of the False Discovery Rate (

\mathsf{FDR}

). The

\mathsf{FDR}

is the expected False Discovery Proportion (

\mathsf{FDP}

), that is, the expected fraction of false rejections among all rejected hypotheses. A number of refinements to the original Benjamini-Hochberg procedure [1] have been proposed, to increase power by estimating the proportion of true null hypotheses, either implicitly, leading to one-stage adaptive procedures [4, 7] or explicitly, leading to two-stage adaptive (or plug-in) procedures [2, 21]. We use a variant of the stochastic process approach proposed by Genovese and Wasserman [11] to study the fluctuations of the

\mathsf{FDP}

achieved with each of these procedures around its expectation, for independent tested hypotheses. We introduce a framework for the derivation of generic Central Limit Theorems for the

\mathsf{FDP}

of these procedures, characterizing the associated regularity conditions, and comparing the asymptotic power of the various procedures. We interpret recently proposed one-stage adaptive procedures [4, 7] as fixed points in the iteration of well known two-stage adaptive procedures [2, 21].Comment: Published in at http://dx.doi.org/10.1214/08-EJS207 the Electronic Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

HAL-MINES ParisTech

Hal-Diderot

On false discovery rate thresholding for classification under sparsity

Author: Neuvial Pierre
Roquain Etienne
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2012
Field of study

We study the properties of false discovery rate (FDR) thresholding, viewed as a classification procedure. The "0"-class (null) is assumed to have a known density while the "1"-class (alternative) is obtained from the "0"-class either by translation or by scaling. Furthermore, the "1"-class is assumed to have a small number of elements w.r.t. the "0"-class (sparsity). We focus on densities of the Subbotin family, including Gaussian and Laplace models. Nonasymptotic oracle inequalities are derived for the excess risk of FDR thresholding. These inequalities lead to explicit rates of convergence of the excess risk to zero, as the number m of items to be classified tends to infinity and in a regime where the power of the Bayes rule is away from 0 and 1. Moreover, these theoretical investigations suggest an explicit choice for the target level

\alpha_m

of FDR thresholding, as a function of m. Our oracle inequalities show theoretically that the resulting FDR thresholding adapts to the unknown sparsity regime contained in the data. This property is illustrated with numerical experiments

arXiv.org e-Print Archive

Performance evaluation of DNA copy number segmentation methods

Author: Neuvial Pierre
Pierre-Jean Morgane
Rigaill Guillem
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

A number of bioinformatic or biostatistical methods are available for analyzing DNA copy number profiles measured from microarray or sequencing technologies. In the absence of rich enough gold standard data sets, the performance of these methods is generally assessed using unrealistic simulation studies, or based on small real data analyses. We have designed and implemented a framework to generate realistic DNA copy number profiles of cancer samples with known truth. These profiles are generated by resampling real SNP microarray data from genomic regions with known copy-number state. The original real data have been extracted from dilutions series of tumor cell lines with matched blood samples at several concentrations. Therefore, the signal-to-noise ratio of the generated profiles can be controlled through the (known) percentage of tumor cells in the sample. In this paper, we describe this framework and illustrate some of the benefits of the proposed data generation approach on a practical use case: a comparison study between methods for segmenting DNA copy number profiles from SNP microarrays. This study indicates that no single method is uniformly better than all others. It also helps identifying pros and cons for the compared methods as a function of biologically informative parameters, such as the fraction of tumor cells in the sample and the proportion of heterozygous markers. Availability: R package jointSeg: http://r-forge.r-project.org/R/?group\_id=156

arXiv.org e-Print Archive

Gains in Power from Structured Two-Sample Tests of Means on Graphs

Author: Dudoit Sandrine
Jacob Laurent
Neuvial Pierre
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

Asymptotic Results on Adaptive False Discovery Rate Controlling Procedures Based on Kernel Estimators

Author: Neuvial Pierre
Publication venue
Publication date: 01/01/2013
Field of study

The False Discovery Rate (FDR) is a commonly used type I error rate in multiple testing problems. It is defined as the expected False Discovery Proportion (FDP), that is, the expected fraction of false positives among rejected hypotheses. When the hypotheses are independent, the Benjamini-Hochberg procedure achieves FDR control at any pre-specified level. By construction, FDR control offers no guarantee in terms of power, or type II error. A number of alternative procedures have been developed, including plug-in procedures that aim at gaining power by incorporating an estimate of the proportion of true null hypotheses. In this paper, we study the asymptotic behavior of a class of plug-in procedures based on kernel estimators of the density of the

p

-values, as the number

m

of tested hypotheses grows to infinity. In a setting where the hypotheses tested are independent, we prove that these procedures are asymptotically more powerful in two respects: (i) a tighter asymptotic FDR control for any target FDR level and (ii) a broader range of target levels yielding positive asymptotic power. We also show that this increased asymptotic power comes at the price of slower, non-parametric convergence rates for the FDP. These rates are of the form

m^{-k/(2k+1)}

, where

k

is determined by the regularity of the density of the

p

-value distribution, or, equivalently, of the test statistics distribution. These results are applied to one- and two-sided tests statistics for Gaussian and Laplace location models, and for the Student model

arXiv.org e-Print Archive

On agnostic post hoc approaches to false positive control

Author: Blanchard Gilles
Neuvial Pierre
Roquain Etienne
Publication venue: HAL CCSD
Publication date: 21/10/2019
Field of study

This document is a book chapter which gives a partial survey on post hoc approaches to false positive control

Selective inference after convex clustering with $\ell_1$ penalization

Author: Bachoc François
Maugis-Rabusseau Cathy
Neuvial Pierre
Publication venue
Publication date: 04/09/2023
Field of study

Classical inference methods notoriously fail when applied to data-driven test hypotheses or inference targets. Instead, dedicated methodologies are required to obtain statistical guarantees for these selective inference problems. Selective inference is particularly relevant post-clustering, typically when testing a difference in mean between two clusters. In this paper, we address convex clustering with

\ell_1

penalization, by leveraging related selective inference tools for regression, based on Gaussian vectors conditioned to polyhedral sets. In the one-dimensional case, we prove a polyhedral characterization of obtaining given clusters, than enables us to suggest a test procedure with statistical guarantees. This characterization also allows us to provide a computationally efficient regularization path algorithm. Then, we extend the above test procedure and guarantees to multi-dimensional clustering with

\ell_1

penalization, and also to more general multi-dimensional clusterings that aggregate one-dimensional ones. With various numerical experiments, we validate our statistical guarantees and we demonstrate the power of our methods to detect differences in mean between clusters. Our methods are implemented in the R package poclin.Comment: 40 pages, 8 figure

arXiv.org e-Print Archive

FDP control in multivariate linear models using the bootstrap

Author: Davenport Samuel
Neuvial Pierre
Thirion Bertrand
Publication venue
Publication date: 20/09/2022
Field of study

In this article we develop a method for performing post hoc inference of the False Discovery Proportion (FDP) over multiple contrasts of interest in the multivariate linear model. To do so we use the bootstrap to simulate from the distribution of the null contrasts. We combine the bootstrap with the post hoc inference bounds of Blanchard (2020) and prove that doing so provides simultaneous asymptotic control of the FDP over all subsets of hypotheses. This requires us to demonstrate consistency of the multivariate bootstrap in the linear model, which we do via the Lindeberg Central Limit Theorem, providing a simpler proof of this result than that of Eck (2018). We demonstrate, via simulations, that our approach provides simultaneous control of the FDP over all subsets and is typically more powerful than existing, state of the art, parametric methods. We illustrate our approach on functional Magnetic Resonance Imaging data from the Human Connectome project and on a transcriptomic dataset of chronic obstructive pulmonary disease

arXiv.org e-Print Archive

Post-clustering Inference under Dependency

Author: Cortés Juan
González-Delgado Javier
Neuvial Pierre
Publication venue
Publication date: 18/10/2023
Field of study

Recent work by Gao et al. has laid the foundations for post-clustering inference. For the first time, the authors established a theoretical framework allowing to test for differences between means of estimated clusters. Additionally, they studied the estimation of unknown parameters while controlling the selective type I error. However, their theory was developed for independent observations identically distributed as

p

-dimensional Gaussian variables with a spherical covariance matrix. Here, we aim at extending this framework to a more convenient scenario for practical applications, where arbitrary dependence structures between observations and features are allowed. We show that a

p

-value for post-clustering inference under general dependency can be defined, and we assess the theoretical conditions allowing the compatible estimation of a covariance matrix. The theory is developed for hierarchical agglomerative clustering algorithms with several types of linkages, and for the

k

-means algorithm. We illustrate our method with synthetic data and real data of protein structures

arXiv.org e-Print Archive