1,034 research outputs found
Improving Distributed Gradient Descent Using Reed-Solomon Codes
Today's massively-sized datasets have made it necessary to often perform
computations on them in a distributed manner. In principle, a computational
task is divided into subtasks which are distributed over a cluster operated by
a taskmaster. One issue faced in practice is the delay incurred due to the
presence of slow machines, known as \emph{stragglers}. Several schemes,
including those based on replication, have been proposed in the literature to
mitigate the effects of stragglers and more recently, those inspired by coding
theory have begun to gain traction. In this work, we consider a distributed
gradient descent setting suitable for a wide class of machine learning
problems. We adapt the framework of Tandon et al. (arXiv:1612.03301) and
present a deterministic scheme that, for a prescribed per-machine computational
effort, recovers the gradient from the least number of machines
theoretically permissible, via an decoding algorithm. We also provide
a theoretical delay model which can be used to minimize the expected waiting
time per computation by optimally choosing the parameters of the scheme.
Finally, we supplement our theoretical findings with numerical results that
demonstrate the efficacy of the method and its advantages over competing
schemes
Gradient Coding from Cyclic MDS Codes and Expander Graphs
Gradient coding is a technique for straggler mitigation in distributed
learning. In this paper we design novel gradient codes using tools from
classical coding theory, namely, cyclic MDS codes, which compare favorably with
existing solutions, both in the applicable range of parameters and in the
complexity of the involved algorithms. Second, we introduce an approximate
variant of the gradient coding problem, in which we settle for approximate
gradient computation instead of the exact one. This approach enables graceful
degradation, i.e., the error of the approximate gradient is a
decreasing function of the number of stragglers. Our main result is that
normalized adjacency matrices of expander graphs yield excellent approximate
gradient codes, which enable significantly less computation compared to exact
gradient coding, and guarantee faster convergence than trivial solutions under
standard assumptions. We experimentally test our approach on Amazon EC2, and
show that the generalization error of approximate gradient coding is very close
to the full gradient while requiring significantly less computation from the
workers
Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy
We consider a scenario involving computations over a massive dataset stored
distributedly across multiple workers, which is at the core of distributed
learning algorithms. We propose Lagrange Coded Computing (LCC), a new framework
to simultaneously provide (1) resiliency against stragglers that may prolong
computations; (2) security against Byzantine (or malicious) workers that
deliberately modify the computation for their benefit; and (3)
(information-theoretic) privacy of the dataset amidst possible collusion of
workers. LCC, which leverages the well-known Lagrange polynomial to create
computation redundancy in a novel coded form across workers, can be applied to
any computation scenario in which the function of interest is an arbitrary
multivariate polynomial of the input dataset, hence covering many computations
of interest in machine learning. LCC significantly generalizes prior works to
go beyond linear computations. It also enables secure and private computing in
distributed settings, improving the computation and communication efficiency of
the state-of-the-art. Furthermore, we prove the optimality of LCC by showing
that it achieves the optimal tradeoff between resiliency, security, and
privacy, i.e., in terms of tolerating the maximum number of stragglers and
adversaries, and providing data privacy against the maximum number of colluding
workers. Finally, we show via experiments on Amazon EC2 that LCC speeds up the
conventional uncoded implementation of distributed least-squares linear
regression by up to , and also achieves a
- speedup over the state-of-the-art straggler
mitigation strategies
LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning
Gradient-based distributed learning in Parameter Server (PS) computing
architectures is subject to random delays due to straggling worker nodes, as
well as to possible communication bottlenecks between PS and workers. Solutions
have been recently proposed to separately address these impairments based on
the ideas of gradient coding, worker grouping, and adaptive worker selection.
This paper provides a unified analysis of these techniques in terms of
wall-clock time, communication, and computation complexity measures.
Furthermore, in order to combine the benefits of gradient coding and grouping
in terms of robustness to stragglers with the communication and computation
load gains of adaptive selection, novel strategies, named Lazily Aggregated
Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and
results show that G-LAG provides the best wall-clock time and communication
performance, while maintaining a low computational cost, for two representative
distributions of the computing times of the worker nodes.Comment: Submitte
- …