4,184 research outputs found
Distributed linear regression by averaging
Distributed statistical learning problems arise commonly when dealing with
large datasets. In this setup, datasets are partitioned over machines, which
compute locally, and communicate short messages. Communication is often the
bottleneck. In this paper, we study one-step and iterative weighted parameter
averaging in statistical linear models under data parallelism. We do linear
regression on each machine, send the results to a central server, and take a
weighted average of the parameters. Optionally, we iterate, sending back the
weighted average and doing local ridge regressions centered at it. How does
this work compared to doing linear regression on the full data? Here we study
the performance loss in estimation, test error, and confidence interval length
in high dimensions, where the number of parameters is comparable to the
training data size. We find the performance loss in one-step weighted
averaging, and also give results for iterative averaging. We also find that
different problems are affected differently by the distributed framework.
Estimation error and confidence interval length increase a lot, while
prediction error increases much less. We rely on recent results from random
matrix theory, where we develop a new calculus of deterministic equivalents as
a tool of broader interest.Comment: V2 adds a new section on iterative averaging methods, adds
applications of the calculus of deterministic equivalents, and reorganizes
the pape
Direction-Projection-Permutation for High Dimensional Hypothesis Tests
Motivated by the prevalence of high dimensional low sample size datasets in
modern statistical applications, we propose a general nonparametric framework,
Direction-Projection-Permutation (DiProPerm), for testing high dimensional
hypotheses. The method is aimed at rigorous testing of whether lower
dimensional visual differences are statistically significant. Theoretical
analysis under the non-classical asymptotic regime of dimension going to
infinity for fixed sample size reveals that certain natural variations of
DiProPerm can have very different behaviors. An empirical power study both
confirms the theoretical results and suggests DiProPerm is a powerful test in
many settings. Finally DiProPerm is applied to a high dimensional gene
expression dataset
- …