6 research outputs found

    A Flexible Framework for Hypothesis Testing in High-dimensions

    Full text link
    Hypothesis testing in the linear regression model is a fundamental statistical problem. We consider linear regression in the high-dimensional regime where the number of parameters exceeds the number of samples (p>np> n). In order to make informative inference, we assume that the model is approximately sparse, that is the effect of covariates on the response can be well approximated by conditioning on a relatively small number of covariates whose identities are unknown. We develop a framework for testing very general hypotheses regarding the model parameters. Our framework encompasses testing whether the parameter lies in a convex cone, testing the signal strength, and testing arbitrary functionals of the parameter. We show that the proposed procedure controls the type I error, and also analyze the power of the procedure. Our numerical experiments confirm our theoretical findings and demonstrate that we control false positive rate (type I error) near the nominal level, and have high power. By duality between hypotheses testing and confidence intervals, the proposed framework can be used to obtain valid confidence intervals for various functionals of the model parameters. For linear functionals, the length of confidence intervals is shown to be minimax rate optimal.Comment: 45 page

    Optimal Sparsity Testing in Linear regression Model

    Full text link
    We consider the problem of sparsity testing in the high-dimensional linear regression model. The problem is to test whether the number of non-zero components (aka the sparsity) of the regression parameter θ\theta^* is less than or equal to k0k_0. We pinpoint the minimax separation distances for this problem, which amounts to quantifying how far a k1k_1-sparse vector θ\theta^* has to be from the set of k0k_0-sparse vectors so that a test is able to reject the null hypothesis with high probability. Two scenarios are considered. In the independent scenario, the covariates are i.i.d. normally distributed and the noise level is known. In the general scenario, both the covariance matrix of the covariates and the noise level are unknown. Although the minimax separation distances differ in these two scenarios, both of them actually depend on k0k_0 and k1k_1 illustrating that for this composite-composite testing problem both the size of the null and of the alternative hypotheses play a key role.Comment: 50 page

    Constrained High Dimensional Statistical Inference

    Full text link
    In typical high dimensional statistical inference problems, confidence intervals and hypothesis tests are performed for a low dimensional subset of model parameters under the assumption that the parameters of interest are unconstrained. However, in many problems, there are natural constraints on model parameters and one is interested in whether the parameters are on the boundary of the constraint or not. e.g. non-negativity constraints for transmission rates in network diffusion. In this paper, we provide algorithms to solve this problem of hypothesis testing in high-dimensional statistical models under constrained parameter space. We show that following our testing procedure we are able to get asymptotic designed Type I error under the null. Numerical experiments demonstrate that our algorithm has greater power than the standard algorithms where the constraints are ignored. We demonstrate the effectiveness of our algorithms on two real datasets where we have {\emph{intrinsic}} constraint on the parameters

    Goodness-of-fit testing in high-dimensional generalized linear models

    Full text link
    We propose a family of tests to assess the goodness-of-fit of a high-dimensional generalized linear model. Our framework is flexible and may be used to construct an omnibus test or directed against testing specific non-linearities and interaction effects, or for testing the significance of groups of variables. The methodology is based on extracting left-over signal in the residuals from an initial fit of a generalized linear model. This can be achieved by predicting this signal from the residuals using modern flexible regression or machine learning methods such as random forests or boosted trees. Under the null hypothesis that the generalized linear model is correct, no signal is left in the residuals and our test statistic has a Gaussian limiting distribution, translating to asymptotic control of type I error. Under a local alternative, we establish a guarantee on the power of the test. We illustrate the effectiveness of the methodology on simulated and real data examples by testing goodness-of-fit in logistic regression models. Software implementing the methodology is available in the R package `GRPtests'.Comment: 40 pages, 4 figure

    Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications

    Full text link
    We consider statistical inference for the explained variance βΣβ\beta^{\intercal}\Sigma \beta under the high-dimensional linear model Y=Xβ+ϵY=X\beta+\epsilon in the semi-supervised setting, where β\beta is the regression vector and Σ\Sigma is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax optimal rate of convergence in the general semi-supervised framework. The optimality result characterizes how the unlabelled data affects the minimax optimal rate. Moreover, the limiting distribution for the proposed estimator is established and data-driven confidence intervals for the explained variance are constructed. We further develop a randomized calibration technique for statistical inference in the presence of weak signals and apply the obtained inference results to a range of important statistical problems, including signal detection and global testing, prediction accuracy evaluation, and confidence ball construction. The numerical performance of the proposed methodology is demonstrated in simulation studies and an analysis of estimating heritability for a yeast segregant data set with multiple traits

    Online Debiasing for Adaptively Collected High-dimensional Data with Applications to Time Series Analysis

    Full text link
    Adaptive collection of data is commonplace in applications throughout science and engineering. From the point of view of statistical inference however, adaptive data collection induces memory and correlation in the samples, and poses significant challenge. We consider the high-dimensional linear regression, where the samples are collected adaptively, and the sample size nn can be smaller than pp, the number of covariates. In this setting, there are two distinct sources of bias: the first due to regularization imposed for consistent estimation, e.g. using the LASSO, and the second due to adaptivity in collecting the samples. We propose "online debiasing", a general procedure for estimators such as the LASSO, which addresses both sources of bias. In two concrete contexts (i)(i) time series analysis and (ii)(ii) batched data collection, we demonstrate that online debiasing optimally debiases the LASSO estimate when the underlying parameter θ0\theta_0 has sparsity of order o(n/logp)o(\sqrt{n}/\log p). In this regime, the debiased estimator can be used to compute pp-values and confidence intervals of optimal size.Comment: 66 pages, 2 tables, 11 figures; updated with minor fixes and reorganizatio
    corecore