49 research outputs found
A review of distributed statistical inference
The rapid emergence of massive datasets in various fields poses a serious
challenge to traditional statistical methods. Meanwhile, it provides
opportunities for researchers to develop novel algorithms. Inspired by the idea
of divide-and-conquer, various distributed frameworks for statistical
estimation and inference have been proposed. They were developed to deal with
large-scale statistical optimization problems. This paper aims to provide a
comprehensive review for related literature. It includes parametric models,
nonparametric models, and other frequently used models. Their key ideas and
theoretical properties are summarized. The trade-off between communication cost
and estimate precision together with other concerns are discussed
Data-driven confidence bands for distributed nonparametric regression
Gaussian Process Regression and Kernel Ridge Regression are popular nonparametric regression approaches. Unfortunately, they suffer from high computational complexity rendering them inapplicable to the modern massive datasets. To that end a number of approximations have been suggested, some of them allowing for a distributed implementation. One of them is the divide and conquer approach, splitting the data into a number of partitions, obtaining the local estimates and finally averaging them. In this paper we suggest a novel computationally efficient fully data-driven algorithm, quantifying uncertainty of this method, yielding frequentist -confidence bands. We rigorously demonstrate validity of the algorithm. Another contribution of the paper is a minimax-optimal high-probability bound for the averaged estimator, complementing and generalizing the known risk bounds
Data-driven confidence bands for distributed nonparametric regression
Gaussian Process Regression and Kernel Ridge Regression are popular
nonparametric regression approaches. Unfortunately, they suffer from high
computational complexity rendering them inapplicable to the modern massive
datasets. To that end a number of approximations have been suggested, some of
them allowing for a distributed implementation. One of them is the divide and
conquer approach, splitting the data into a number of partitions, obtaining the
local estimates and finally averaging them. In this paper we suggest a novel
computationally efficient fully data-driven algorithm, quantifying uncertainty
of this method, yielding frequentist -confidence bands. We rigorously
demonstrate validity of the algorithm. Another contribution of the paper is a
minimax-optimal high-probability bound for the averaged estimator,
complementing and generalizing the known risk bounds.Comment: COLT2020 (to appear
On the optimality of misspecified spectral algorithms
In the misspecified spectral algorithms problem, researchers usually assume
the underground true function , a
less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS)
for some . The existing minimax optimal results
require where is the embedding index, a constant
depending on . Whether the spectral algorithms are optimal for all
is an outstanding problem lasting for years. In this paper, we
show that spectral algorithms are minimax optimal for any
, where is the eigenvalue decay
rate of . We also give several classes of RKHSs whose embedding
index satisfies . Thus, the spectral algorithms
are minimax optimal for all on these RKHSs.Comment: 48 pages, 2 figure
Optimal Statistical Rates for Decentralised Non-Parametric Regression with Linear Speed-Up
We analyse the learning performance of Distributed Gradient Descent in the
context of multi-agent decentralised non-parametric regression with the square
loss function when i.i.d. samples are assigned to agents. We show that if
agents hold sufficiently many samples with respect to the network size, then
Distributed Gradient Descent achieves optimal statistical rates with a number
of iterations that scales, up to a threshold, with the inverse of the spectral
gap of the gossip matrix divided by the number of samples owned by each agent
raised to a problem-dependent power. The presence of the threshold comes from
statistics. It encodes the existence of a "big data" regime where the number of
required iterations does not depend on the network topology. In this regime,
Distributed Gradient Descent achieves optimal statistical rates with the same
order of iterations as gradient descent run with all the samples in the
network. Provided the communication delay is sufficiently small, the
distributed protocol yields a linear speed-up in runtime compared to the
single-machine protocol. This is in contrast to decentralised optimisation
algorithms that do not exploit statistics and only yield a linear speed-up in
graphs where the spectral gap is bounded away from zero. Our results exploit
the statistical concentration of quantities held by agents and shed new light
on the interplay between statistics and communication in decentralised methods.
Bounds are given in the standard non-parametric setting with source/capacity
assumptions