6 research outputs found
A Flexible Framework for Hypothesis Testing in High-dimensions
Hypothesis testing in the linear regression model is a fundamental
statistical problem. We consider linear regression in the high-dimensional
regime where the number of parameters exceeds the number of samples ().
In order to make informative inference, we assume that the model is
approximately sparse, that is the effect of covariates on the response can be
well approximated by conditioning on a relatively small number of covariates
whose identities are unknown. We develop a framework for testing very general
hypotheses regarding the model parameters. Our framework encompasses testing
whether the parameter lies in a convex cone, testing the signal strength, and
testing arbitrary functionals of the parameter. We show that the proposed
procedure controls the type I error, and also analyze the power of the
procedure. Our numerical experiments confirm our theoretical findings and
demonstrate that we control false positive rate (type I error) near the nominal
level, and have high power. By duality between hypotheses testing and
confidence intervals, the proposed framework can be used to obtain valid
confidence intervals for various functionals of the model parameters. For
linear functionals, the length of confidence intervals is shown to be minimax
rate optimal.Comment: 45 page
Optimal Sparsity Testing in Linear regression Model
We consider the problem of sparsity testing in the high-dimensional linear
regression model. The problem is to test whether the number of non-zero
components (aka the sparsity) of the regression parameter is less
than or equal to . We pinpoint the minimax separation distances for this
problem, which amounts to quantifying how far a -sparse vector
has to be from the set of -sparse vectors so that a test is able to reject
the null hypothesis with high probability. Two scenarios are considered. In the
independent scenario, the covariates are i.i.d. normally distributed and the
noise level is known. In the general scenario, both the covariance matrix of
the covariates and the noise level are unknown. Although the minimax separation
distances differ in these two scenarios, both of them actually depend on
and illustrating that for this composite-composite testing problem both
the size of the null and of the alternative hypotheses play a key role.Comment: 50 page
Constrained High Dimensional Statistical Inference
In typical high dimensional statistical inference problems, confidence
intervals and hypothesis tests are performed for a low dimensional subset of
model parameters under the assumption that the parameters of interest are
unconstrained. However, in many problems, there are natural constraints on
model parameters and one is interested in whether the parameters are on the
boundary of the constraint or not. e.g. non-negativity constraints for
transmission rates in network diffusion. In this paper, we provide algorithms
to solve this problem of hypothesis testing in high-dimensional statistical
models under constrained parameter space. We show that following our testing
procedure we are able to get asymptotic designed Type I error under the null.
Numerical experiments demonstrate that our algorithm has greater power than the
standard algorithms where the constraints are ignored. We demonstrate the
effectiveness of our algorithms on two real datasets where we have
{\emph{intrinsic}} constraint on the parameters
Goodness-of-fit testing in high-dimensional generalized linear models
We propose a family of tests to assess the goodness-of-fit of a
high-dimensional generalized linear model. Our framework is flexible and may be
used to construct an omnibus test or directed against testing specific
non-linearities and interaction effects, or for testing the significance of
groups of variables. The methodology is based on extracting left-over signal in
the residuals from an initial fit of a generalized linear model. This can be
achieved by predicting this signal from the residuals using modern flexible
regression or machine learning methods such as random forests or boosted trees.
Under the null hypothesis that the generalized linear model is correct, no
signal is left in the residuals and our test statistic has a Gaussian limiting
distribution, translating to asymptotic control of type I error. Under a local
alternative, we establish a guarantee on the power of the test. We illustrate
the effectiveness of the methodology on simulated and real data examples by
testing goodness-of-fit in logistic regression models. Software implementing
the methodology is available in the R package `GRPtests'.Comment: 40 pages, 4 figure
Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications
We consider statistical inference for the explained variance
under the high-dimensional linear model
in the semi-supervised setting, where is the
regression vector and is the design covariance matrix. A calibrated
estimator, which efficiently integrates both labelled and unlabelled data, is
proposed. It is shown that the estimator achieves the minimax optimal rate of
convergence in the general semi-supervised framework. The optimality result
characterizes how the unlabelled data affects the minimax optimal rate.
Moreover, the limiting distribution for the proposed estimator is established
and data-driven confidence intervals for the explained variance are
constructed. We further develop a randomized calibration technique for
statistical inference in the presence of weak signals and apply the obtained
inference results to a range of important statistical problems, including
signal detection and global testing, prediction accuracy evaluation, and
confidence ball construction. The numerical performance of the proposed
methodology is demonstrated in simulation studies and an analysis of estimating
heritability for a yeast segregant data set with multiple traits
Online Debiasing for Adaptively Collected High-dimensional Data with Applications to Time Series Analysis
Adaptive collection of data is commonplace in applications throughout science
and engineering. From the point of view of statistical inference however,
adaptive data collection induces memory and correlation in the samples, and
poses significant challenge. We consider the high-dimensional linear
regression, where the samples are collected adaptively, and the sample size
can be smaller than , the number of covariates. In this setting, there are
two distinct sources of bias: the first due to regularization imposed for
consistent estimation, e.g. using the LASSO, and the second due to adaptivity
in collecting the samples. We propose "online debiasing", a general procedure
for estimators such as the LASSO, which addresses both sources of bias. In two
concrete contexts time series analysis and batched data
collection, we demonstrate that online debiasing optimally debiases the LASSO
estimate when the underlying parameter has sparsity of order
. In this regime, the debiased estimator can be used to
compute -values and confidence intervals of optimal size.Comment: 66 pages, 2 tables, 11 figures; updated with minor fixes and
reorganizatio