3,277 research outputs found
Comparing large covariance matrices under weak conditions on the dependence structure and its application to gene clustering
Comparing large covariance matrices has important applications in modern
genomics, where scientists are often interested in understanding whether
relationships (e.g., dependencies or co-regulations) among a large number of
genes vary between different biological states. We propose a computationally
fast procedure for testing the equality of two large covariance matrices when
the dimensions of the covariance matrices are much larger than the sample
sizes. A distinguishing feature of the new procedure is that it imposes no
structural assumptions on the unknown covariance matrices. Hence the test is
robust with respect to various complex dependence structures that frequently
arise in genomics. We prove that the proposed procedure is asymptotically valid
under weak moment conditions. As an interesting application, we derive a new
gene clustering algorithm which shares the same nice property of avoiding
restrictive structural assumptions for high-dimensional genomics data. Using an
asthma gene expression dataset, we illustrate how the new test helps compare
the covariance matrices of the genes across different gene sets/pathways
between the disease group and the control group, and how the gene clustering
algorithm provides new insights on the way gene clustering patterns differ
between the two groups. The proposed methods have been implemented in an
R-package HDtest and is available on CRAN.Comment: The original title dated back to May 2015 is "Bootstrap Tests on High
Dimensional Covariance Matrices with Applications to Understanding Gene
Clustering
Simulation-Based Hypothesis Testing of High Dimensional Means Under Covariance Heterogeneity
In this paper, we study the problem of testing the mean vectors of high
dimensional data in both one-sample and two-sample cases. The proposed testing
procedures employ maximum-type statistics and the parametric bootstrap
techniques to compute the critical values. Different from the existing tests
that heavily rely on the structural conditions on the unknown covariance
matrices, the proposed tests allow general covariance structures of the data
and therefore enjoy wide scope of applicability in practice. To enhance powers
of the tests against sparse alternatives, we further propose two-step
procedures with a preliminary feature screening step. Theoretical properties of
the proposed tests are investigated. Through extensive numerical experiments on
synthetic datasets and an human acute lymphoblastic leukemia gene expression
dataset, we illustrate the performance of the new tests and how they may
provide assistance on detecting disease-associated gene-sets. The proposed
methods have been implemented in an R-package HDtest and are available on CRAN.Comment: 34 pages, 10 figures; Accepted for biometric
Matrix Completion via Max-Norm Constrained Optimization
Matrix completion has been well studied under the uniform sampling model and
the trace-norm regularized methods perform well both theoretically and
numerically in such a setting. However, the uniform sampling model is
unrealistic for a range of applications and the standard trace-norm relaxation
can behave very poorly when the underlying sampling scheme is non-uniform.
In this paper we propose and analyze a max-norm constrained empirical risk
minimization method for noisy matrix completion under a general sampling model.
The optimal rate of convergence is established under the Frobenius norm loss in
the context of approximately low-rank matrix reconstruction. It is shown that
the max-norm constrained method is minimax rate-optimal and yields a unified
and robust approximate recovery guarantee, with respect to the sampling
distributions. The computational effectiveness of this method is also
discussed, based on first-order algorithms for solving convex optimizations
involving max-norm regularization.Comment: 33 page
A Max-Norm Constrained Minimization Approach to 1-Bit Matrix Completion
We consider in this paper the problem of noisy 1-bit matrix completion under
a general non-uniform sampling distribution using the max-norm as a convex
relaxation for the rank. A max-norm constrained maximum likelihood estimate is
introduced and studied. The rate of convergence for the estimate is obtained.
Information-theoretical methods are used to establish a minimax lower bound
under the general sampling model. The minimax upper and lower bounds together
yield the optimal rate of convergence for the Frobenius norm loss.
Computational algorithms and numerical performance are also discussed.Comment: 33 pages, 3 figure
Cram\'er type moderate deviation theorems for self-normalized processes
Cram\'er type moderate deviation theorems quantify the accuracy of the
relative error of the normal approximation and provide theoretical
justifications for many commonly used methods in statistics. In this paper, we
develop a new randomized concentration inequality and establish a Cram\'er type
moderate deviation theorem for general self-normalized processes which include
many well-known Studentized nonlinear statistics. In particular, a sharp
moderate deviation theorem under optimal moment conditions is established for
Studentized -statistics.Comment: Published at http://dx.doi.org/10.3150/15-BEJ719 in the Bernoulli
(http://isi.cbs.nl/bernoulli/) by the International Statistical
Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm
Nonparametric covariate-adjusted regression
We consider nonparametric estimation of a regression curve when the data are
observed with multiplicative distortion which depends on an observed
confounding variable. We suggest several estimators, ranging from a relatively
simple one that relies on restrictive assumptions usually made in the
literature, to a sophisticated piecewise approach that involves reconstructing
a smooth curve from an estimator of a constant multiple of its absolute value,
and which can be applied in much more general scenarios. We show that, although
our nonparametric estimators are constructed from predictors of the unobserved
undistorted data, they have the same first order asymptotic properties as the
standard estimators that could be computed if the undistorted data were
available. We illustrate the good numerical performance of our methods on both
simulated and real datasets.Comment: 32 pages, 4 figure
On Gaussian Comparison Inequality and Its Application to Spectral Analysis of Large Random Matrices
Recently, Chernozhukov, Chetverikov, and Kato [Ann. Statist. 42 (2014)
1564--1597] developed a new Gaussian comparison inequality for approximating
the suprema of empirical processes. This paper exploits this technique to
devise sharp inference on spectra of large random matrices. In particular, we
show that two long-standing problems in random matrix theory can be solved: (i)
simple bootstrap inference on sample eigenvalues when true eigenvalues are
tied; (ii) conducting two-sample Roy's covariance test in high dimensions. To
establish the asymptotic results, a generalized -net argument
regarding the matrix rescaled spectral norm and several new empirical process
bounds are developed and of independent interest.Comment: to appear in Bernoull
Are Discoveries Spurious? Distributions of Maximum Spurious Correlations and Their Applications
Over the last two decades, many exciting variable selection methods have been
developed for finding a small group of covariates that are associated with the
response from a large pool. Can the discoveries from these data mining
approaches be spurious due to high dimensionality and limited sample size? Can
our fundamental assumptions about the exogeneity of the covariates needed for
such variable selection be validated with the data? To answer these questions,
we need to derive the distributions of the maximum spurious correlations given
a certain number of predictors, namely, the distribution of the correlation of
a response variable with the best linear combinations of covariates
, even when and are independent. When the
covariance matrix of possesses the restricted eigenvalue property,
we derive such distributions for both a finite and a diverging , using
Gaussian approximation and empirical process techniques. However, such a
distribution depends on the unknown covariance matrix of . Hence,
we use the multiplier bootstrap procedure to approximate the unknown
distributions and establish the consistency of such a simple bootstrap
approach. The results are further extended to the situation where the residuals
are from regularized fits. Our approach is then used to construct the upper
confidence limit for the maximum spurious correlation and to test the
exogeneity of the covariates. The former provides a baseline for guarding
against false discoveries and the latter tests whether our fundamental
assumptions for high-dimensional model selection are statistically valid. Our
techniques and results are illustrated with both numerical examples and real
data analysis
- …