204,669 research outputs found
Effects of dependence in high-dimensional multiple testing problems
<p>Abstract</p> <p>Background</p> <p>We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR) control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices.</p> <p>Results</p> <p>We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable.</p> <p>Conclusion</p> <p>We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on <it>Ï€</it><sub>0 </sub>or FDR estimation in a dependency context.</p
Recommended from our members
Covariate-assisted ranking and screening for large-scale two-sample inference
Two-sample multiple testing has a wide range of applications. The conventionalpractice first reduces the original observations to a vector of p-values and then chooses a cutoffto adjust for multiplicity. However, this data reduction step could cause significant loss ofinformation and thus lead to suboptimal testing procedures.We introduce a new framework fortwo-sample multiple testing by incorporating a carefully constructed auxiliary variable in inferenceto improve the power. A data-driven multiple-testing procedure is developed by employinga covariate-assisted ranking and screening (CARS) approach that optimally combines the informationfrom both the primary and the auxiliary variables. The proposed CARS procedureis shown to be asymptotically valid and optimal for false discovery rate control. The procedureis implemented in the R package CARS. Numerical results confirm the effectiveness of CARSin false discovery rate control and show that it achieves substantial power gain over existingmethods. CARS is also illustrated through an application to the analysis of a satellite imagingdata set for supernova detection
Cram\'{e}r-type moderate deviations for Studentized two-sample -statistics with applications
Two-sample -statistics are widely used in a broad range of applications,
including those in the fields of biostatistics and econometrics. In this paper,
we establish sharp Cram\'{e}r-type moderate deviation theorems for Studentized
two-sample -statistics in a general framework, including the two-sample
-statistic and Studentized Mann-Whitney test statistic as prototypical
examples. In particular, a refined moderate deviation theorem with second-order
accuracy is established for the two-sample -statistic. These results extend
the applicability of the existing statistical methodologies from the one-sample
-statistic to more general nonlinear statistics. Applications to two-sample
large-scale multiple testing problems with false discovery rate control and the
regularized bootstrap method are also discussed.Comment: Published at http://dx.doi.org/10.1214/15-AOS1375 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity
In this paper, we study the problem of testing the mean vectors of high
dimensional data in both one-sample and two-sample cases. The proposed testing
procedures employ maximum-type statistics and the parametric bootstrap
techniques to compute the critical values. Different from the existing tests
that heavily rely on the structural conditions on the unknown covariance
matrices, the proposed tests allow general covariance structures of the data
and therefore enjoy wide scope of applicability in practice. To enhance powers
of the tests against sparse alternatives, we further propose two-step
procedures with a preliminary feature screening step. Theoretical properties of
the proposed tests are investigated. Through extensive numerical experiments on
synthetic datasets and an human acute lymphoblastic leukemia gene expression
dataset, we illustrate the performance of the new tests and how they may
provide assistance on detecting disease-associated gene-sets. The proposed
methods have been implemented in an R-package HDtest and are available on CRAN.Comment: 34 pages, 10 figures; Accepted for biometric
Multi-Entity Dependence Learning with Rich Context via Conditional Variational Auto-encoder
Multi-Entity Dependence Learning (MEDL) explores conditional correlations
among multiple entities. The availability of rich contextual information
requires a nimble learning scheme that tightly integrates with deep neural
networks and has the ability to capture correlation structures among
exponentially many outcomes. We propose MEDL_CVAE, which encodes a conditional
multivariate distribution as a generating process. As a result, the variational
lower bound of the joint likelihood can be optimized via a conditional
variational auto-encoder and trained end-to-end on GPUs. Our MEDL_CVAE was
motivated by two real-world applications in computational sustainability: one
studies the spatial correlation among multiple bird species using the eBird
data and the other models multi-dimensional landscape composition and human
footprint in the Amazon rainforest with satellite images. We show that
MEDL_CVAE captures rich dependency structures, scales better than previous
methods, and further improves on the joint likelihood taking advantage of very
large datasets that are beyond the capacity of previous methods.Comment: The first two authors contribute equall
- …