49 research outputs found
Distributed linear regression by averaging
Distributed statistical learning problems arise commonly when dealing with
large datasets. In this setup, datasets are partitioned over machines, which
compute locally, and communicate short messages. Communication is often the
bottleneck. In this paper, we study one-step and iterative weighted parameter
averaging in statistical linear models under data parallelism. We do linear
regression on each machine, send the results to a central server, and take a
weighted average of the parameters. Optionally, we iterate, sending back the
weighted average and doing local ridge regressions centered at it. How does
this work compared to doing linear regression on the full data? Here we study
the performance loss in estimation, test error, and confidence interval length
in high dimensions, where the number of parameters is comparable to the
training data size. We find the performance loss in one-step weighted
averaging, and also give results for iterative averaging. We also find that
different problems are affected differently by the distributed framework.
Estimation error and confidence interval length increase a lot, while
prediction error increases much less. We rely on recent results from random
matrix theory, where we develop a new calculus of deterministic equivalents as
a tool of broader interest.Comment: V2 adds a new section on iterative averaging methods, adds
applications of the calculus of deterministic equivalents, and reorganizes
the pape
Robust Inference Under Heteroskedasticity via the Hadamard Estimator
Drawing statistical inferences from large datasets in a model-robust way is
an important problem in statistics and data science. In this paper, we propose
methods that are robust to large and unequal noise in different observational
units (i.e., heteroskedasticity) for statistical inference in linear
regression. We leverage the Hadamard estimator, which is unbiased for the
variances of ordinary least-squares regression. This is in contrast to the
popular White's sandwich estimator, which can be substantially biased in high
dimensions. We propose to estimate the signal strength, noise level,
signal-to-noise ratio, and mean squared error via the Hadamard estimator. We
develop a new degrees of freedom adjustment that gives more accurate confidence
intervals than variants of White's sandwich estimator. Moreover, we provide
conditions ensuring the estimator is well-defined, by studying a new random
matrix ensemble in which the entries of a random orthogonal projection matrix
are squared. We also show approximate normality, using the second-order
Poincare inequality. Our work provides improved statistical theory and methods
for linear regression in high dimensions
Regularity Properties for Sparse Regression
Statistical and machine learning theory has developed several conditions
ensuring that popular estimators such as the Lasso or the Dantzig selector
perform well in high-dimensional sparse regression, including the restricted
eigenvalue, compatibility, and sensitivity properties. However, some
of the central aspects of these conditions are not well understood. For
instance, it is unknown if these conditions can be checked efficiently on any
given data set. This is problematic, because they are at the core of the theory
of sparse regression.
Here we provide a rigorous proof that these conditions are NP-hard to check.
This shows that the conditions are computationally infeasible to verify, and
raises some questions about their practical applications.
However, by taking an average-case perspective instead of the worst-case view
of NP-hardness, we show that a particular condition, sensitivity, has
certain desirable properties. This condition is weaker and more general than
the others. We show that it holds with high probability in models where the
parent population is well behaved, and that it is robust to certain data
processing steps. These results are desirable, as they provide guidance about
when the condition, and more generally the theory of sparse regression, may be
relevant in the analysis of high-dimensional correlated observational data.Comment: Manuscript shortened and more motivation added. To appear in
Communications in Mathematics and Statistic