26 research outputs found
The Hardness of Conditional Independence Testing and the Generalised Covariance Measure
It is a common saying that testing for conditional independence, i.e., testing whether whether two random vectors X and Y are independent, given Z, is a hard statistical problem if Z is a continuous random variable (or vector). In this paper, we prove that conditional independence is indeed a particularly difficult hypothesis to test for. Statistical tests are required to have a size that is smaller than a predefined significance level, and different tests usually have power against a different class of alternatives. We prove that a valid test for conditional independence does not have power against any alternative.
Given the non-existence of a uniformly valid conditional independence test, we argue that tests must be designed so their suitability for a particular problem setting may be judged easily. To address this need, we propose in the case where X and Y are univariate to nonlinearly regress X on Z, and Y on Z and then compute a test statistic based on the sample covariance between the residuals, which we call the generalised covariance measure (GCM). We prove that validity of this form of test relies almost entirely on the weak requirement that the regression procedures are able to estimate the conditional means X given Z, and Y given Z, at a slow rate. We extend the methodology to handle settings where X and Y may be multivariate or even high-dimensional. While our general procedure can be tailored to the setting at hand by combining it with any regression technique, we develop the theoretical guarantees for kernel ridge regression. A simulation study shows that the test based on GCM is competitive with state of the art conditional independence tests. Code is available as the R package GeneralisedCovarianceMeasure on CRAN
On b-bit min-wise hashing for large-scale regression and classification with sparse data
Large-scale regression problems where both the number of variables, , and the number of observations, , may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns and then work with this compressed data.
-bit min-wise hashing is a promising dimension reduction scheme for sparse matrices which produces a set of random features such that regression on the resulting design matrix approximates a kernel regression with the resemblance kernel. In this work, we derive bounds on the prediction error of such regressions. For both linear and logistic models, we show that the average prediction error vanishes asymptotically as long as , where is the average number of non-zero entries in each row of the design matrix and is the coefficient of the linear predictor.
We also show that ordinary least squares or ridge regression applied to the reduced data can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied in order for the signal to be linear in the predictors.The first author was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and an EPSRC programme grant
Average partial effect estimation using double machine learning
Single-parameter summaries of variable effects are desirable for ease of
interpretation, but linear models, which would deliver these, may fit poorly to
the data. A modern approach is to estimate the average partial effect -- the
average slope of the regression function with respect to the predictor of
interest -- using a doubly robust semiparametric procedure. Most existing work
has focused on specific forms of nuisance function estimators. We extend the
scope to arbitrary plug-in nuisance function estimation, allowing for the use
of modern machine learning methods which in particular may deliver
non-differentiable regression function estimates. Our procedure involves
resmoothing a user-chosen first-stage regression estimator to produce a
differentiable version, and modelling the conditional distribution of the
predictors through a location-scale model. We show that our proposals lead to a
semiparametric efficient estimator under relatively weak assumptions. Our
theory makes use of a new result on the sub-Gaussianity of Lipschitz score
functions that may be of independent interest. We demonstrate the attractive
numerical performance of our approach in a variety of settings including ones
with misspecification.Comment: 61 pages, 4 figure
Debiased Inverse Propensity Score Weighting for Estimation of Average Treatment Effects with High-Dimensional Confounders
We consider estimation of average treatment effects given observational data
with high-dimensional pretreatment variables. Existing methods for this problem
typically assume some form of sparsity for the regression functions. In this
work, we introduce a debiased inverse propensity score weighting (DIPW) scheme
for average treatment effect estimation that delivers -consistent
estimates of the average treatment effect when the propensity score follows a
sparse logistic regression model; the regression functions are permitted to be
arbitrarily complex. Our theoretical results quantify the price to pay for
permitting the regression functions to be unestimable, which shows up as an
inflation of the variance of the estimator compared to the semiparametric
efficient variance by at most O(1) under mild conditions. Given the lack of
assumptions on the regression functions, averages of transformed responses
under each treatment may also be estimated at the rate, and so for
example, the variances of the potential outcomes may be estimated. We show how
confidence intervals centred on our estimates may be constructed, and also
discuss an extension of the method to estimating projections of the
heterogeneous treatment effect function
Sandwich Boosting for Accurate Estimation in Partially Linear Models for Grouped Data
We study partially linear models in settings where observations are arranged
in independent groups but may exhibit within-group dependence. Existing
approaches estimate linear model parameters through weighted least squares,
with optimal weights (given by the inverse covariance of the response,
conditional on the covariates) typically estimated by maximising a (restricted)
likelihood from random effects modelling or by using generalised estimating
equations. We introduce a new 'sandwich loss' whose population minimiser
coincides with the weights of these approaches when the parametric forms for
the conditional covariance are well-specified, but can yield arbitrarily large
improvements in linear parameter estimation accuracy when they are not. Under
relatively mild conditions, our estimated coefficients are asymptotically
Gaussian and enjoy minimal variance among estimators with weights restricted to
a given class of functions, when user-chosen regression methods are used to
estimate nuisance functions. We further expand the class of functional forms
for the weights that may be fitted beyond parametric models by leveraging the
flexibility of modern machine learning methods within a new gradient boosting
scheme for minimising the sandwich loss. We demonstrate the effectiveness of
both the sandwich loss and what we call 'sandwich boosting' in a variety of
settings with simulated and real-world data
The Projected Covariance Measure for assumption-lean variable significance testing
Testing the significance of a variable or group of variables for
predicting a response , given additional covariates , is a ubiquitous
task in statistics. A simple but common approach is to specify a linear model,
and then test whether the regression coefficient for is non-zero. However,
when the model is misspecified, the test may have poor power, for example when
is involved in complex interactions, or lead to many false rejections. In
this work we study the problem of testing the model-free null of conditional
mean independence, i.e. that the conditional mean of given and does
not depend on . We propose a simple and general framework that can leverage
flexible nonparametric or machine learning methods, such as additive models or
random forests, to yield both robust error control and high power. The
procedure involves using these methods to perform regressions, first to
estimate a form of projection of on and using one half of the data,
and then to estimate the expected conditional covariance between this
projection and on the remaining half of the data. While the approach is
general, we show that a version of our procedure using spline regression
achieves what we show is the minimax optimal rate in this nonparametric testing
problem. Numerical experiments demonstrate the effectiveness of our approach
both in terms of maintaining Type I error control, and power, compared to
several existing approaches.Comment: 89 pages, 5 figure