8 research outputs found
Age-Based Coded Computation for Bias Reduction in Distributed Learning
Coded computation can be used to speed up distributed learning in the
presence of straggling workers. Partial recovery of the gradient vector can
further reduce the computation time at each iteration; however, this can result
in biased estimators, which may slow down convergence, or even cause
divergence. Estimator bias will be particularly prevalent when the straggling
behavior is correlated over time, which results in the gradient estimators
being dominated by a few fast servers. To mitigate biased estimators, we design
a dynamic encoding framework for partial recovery that includes an
ordering operator that changes the codewords and computation orders at workers
over time. To regulate the recovery frequencies, we adopt an metric in
the design of the dynamic encoding scheme. We show through numerical results
that the proposed dynamic encoding strategy increases the timeliness of the
recovered computations, which as a result, reduces the bias in model updates,
and accelerates the convergence compared to the conventional static partial
recovery schemes
Coded Distributed Computing with Partial Recovery
Coded computation techniques provide robustness against straggling workers in
distributed computing. However, most of the existing schemes require exact
provisioning of the straggling behaviour and ignore the computations carried
out by straggling workers. Moreover, these schemes are typically designed to
recover the desired computation results accurately, while in many machine
learning and iterative optimization algorithms, faster approximate solutions
are known to result in an improvement in the overall convergence time. In this
paper, we first introduce a novel coded matrix-vector multiplication scheme,
called coded computation with partial recovery (CCPR), which benefits from the
advantages of both coded and uncoded computation schemes, and reduces both the
computation time and the decoding complexity by allowing a trade-off between
the accuracy and the speed of computation. We then extend this approach to
distributed implementation of more general computation tasks by proposing a
coded communication scheme with partial recovery, where the results of subtasks
computed by the workers are coded before being communicated. Numerical
simulations on a large linear regression task confirm the benefits of the
proposed distributed computation scheme with partial recovery in terms of the
trade-off between the computation accuracy and latency.Comment: Accepted to IEEE Transactions on Information Theory. Presented in
part at the 2019 IEEE International Conference on Acoustics, Speech and
Signal Processing in Brighton, U
On Gradient Coding with Partial Recovery
We consider a generalization of the recently proposed gradient coding
framework where a large dataset is divided across workers and each worker
transmits to a master node one or more linear combinations of the gradients
over the data subsets assigned to it. Unlike the conventional framework which
requires the master node to recover the sum of the gradients over all the data
subsets in the presence of straggler workers, we relax the goal of the
master node to computing the sum of at least some fraction of the
gradients. The broad goal of our work is to study the optimal computation and
communication load per worker for this approximate gradient coding framework.
We begin by deriving a lower bound on the computation load of any feasible
scheme and also propose a strategy which achieves this lower bound, albeit at
the cost of high communication load and a number of data partitions which can
be polynomial in the number of workers . We then restrict attention to
schemes which utilize a number of data partitions equal to and propose
schemes based on cyclic assignment which have a lower communication load. When
each worker transmits a single linear combination, we also prove lower bounds
on the computation load of any scheme using data partitions
Efficient Replication for Straggler Mitigation in Distributed Computing
Master-worker distributed computing systems use task replication in order to
mitigate the effect of slow workers, known as stragglers. Tasks are grouped
into batches and assigned to one or more workers for execution. We first
consider the case when the batches do not overlap and, using the results from
majorization theory, show that, for a general class of workers' service time
distributions, a balanced assignment of batches to workers minimizes the
average job compute time. We next show that this balanced assignment of
non-overlapping batches achieves lower average job compute time compared to the
overlapping schemes proposed in the literature. Furthermore, we derive the
optimum redundancy level as a function of the service time distribution at
workers. We show that the redundancy level that minimizes average job compute
time is not necessarily the same as the redundancy level that maximizes the
predictability of job compute time, and thus there exists a trade-off between
optimizing the two metrics. Finally, by running experiments on Google cluster
traces, we observe that redundancy can reduce the compute time of the jobs in
Google clusters by an order of magnitude, and that the optimum level of
redundancy depends on the distribution of tasks' service time
Approximate Gradient Coding with Optimal Decoding
In distributed optimization problems, a technique called gradient coding,
which involves replicating data points, has been used to mitigate the effect of
straggling machines. Recent work has studied approximate gradient coding, which
concerns coding schemes where the replication factor of the data is too low to
recover the full gradient exactly. Our work is motivated by the challenge of
creating approximate gradient coding schemes that simultaneously work well in
both the adversarial and stochastic models. To that end, we introduce novel
approximate gradient codes based on expander graphs, in which each machine
receives exactly two blocks of data points. We analyze the decoding error both
in the random and adversarial straggler setting, when optimal decoding
coefficients are used. We show that in the random setting, our schemes achieve
an error to the gradient that decays exponentially in the replication factor.
In the adversarial setting, the error is nearly a factor of two smaller than
any existing code with similar performance in the random setting. We show
convergence bounds both in the random and adversarial setting for gradient
descent under standard assumptions using our codes. In the random setting, our
convergence rate improves upon block-box bounds. In the adversarial setting, we
show that gradient descent can converge down to a noise floor that scales
linearly with the adversarial error to the gradient. We demonstrate empirically
that our schemes achieve near-optimal error in the random setting and converge
faster than algorithms which do not use the optimal decoding coefficients
Slow and Stale Gradients Can Win the Race
Distributed Stochastic Gradient Descent (SGD) when run in a synchronous
manner, suffers from delays in runtime as it waits for the slowest workers
(stragglers). Asynchronous methods can alleviate stragglers, but cause gradient
staleness that can adversely affect the convergence error. In this work, we
present a novel theoretical characterization of the speedup offered by
asynchronous methods by analyzing the trade-off between the error in the
trained model and the actual training runtime(wallclock time). The main novelty
in our work is that our runtime analysis considers random straggling delays,
which helps us design and compare distributed SGD algorithms that strike a
balance between straggling and staleness. We also provide a new error
convergence analysis of asynchronous SGD variants without bounded or
exponential delay assumptions. Finally, based on our theoretical
characterization of the error-runtime trade-off, we propose a method of
gradually varying synchronicity in distributed SGD and demonstrate its
performance on CIFAR10 dataset.Comment: Some of the results have appeared in AISTATS 2018. This is an
extended version with additional results, in particular, an adaptive
synchronicity strategy called AdaSync. arXiv admin note: substantial text
overlap with arXiv:1803.0111
A Survey of Coded Distributed Computing
Distributed computing has become a common approach for large-scale
computation of tasks due to benefits such as high reliability, scalability,
computation speed, and costeffectiveness. However, distributed computing faces
critical issues related to communication load and straggler effects. In
particular, computing nodes need to exchange intermediate results with each
other in order to calculate the final result, and this significantly increases
communication overheads. Furthermore, a distributed computing network may
include straggling nodes that run intermittently slower. This results in a
longer overall time needed to execute the computation tasks, thereby limiting
the performance of distributed computing. To address these issues, coded
distributed computing (CDC), i.e., a combination of coding theoretic techniques
and distributed computing, has been recently proposed as a promising solution.
Coding theoretic techniques have proved effective in WiFi and cellular systems
to deal with channel noise. Therefore, CDC may significantly reduce
communication load, alleviate the effects of stragglers, provide
fault-tolerance, privacy and security. In this survey, we first introduce the
fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze
a number of CDC approaches proposed to reduce the communication costs, mitigate
the straggler effects, and guarantee privacy and security. Furthermore, we
present and discuss applications of CDC in modern computer networks. Finally,
we highlight important challenges and promising research directions related to
CD
Communication-Efficient Edge AI: Algorithms and Systems
Artificial intelligence (AI) has achieved remarkable breakthroughs in a wide
range of fields, ranging from speech processing, image classification to drug
discovery. This is driven by the explosive growth of data, advances in machine
learning (especially deep learning), and easy access to vastly powerful
computing resources. Particularly, the wide scale deployment of edge devices
(e.g., IoT devices) generates an unprecedented scale of data, which provides
the opportunity to derive accurate models and develop various intelligent
applications at the network edge. However, such enormous data cannot all be
sent from end devices to the cloud for processing, due to the varying channel
quality, traffic congestion and/or privacy concerns. By pushing inference and
training processes of AI models to edge nodes, edge AI has emerged as a
promising alternative. AI at the edge requires close cooperation among edge
devices, such as smart phones and smart vehicles, and edge servers at the
wireless access points and base stations, which however result in heavy
communication overheads. In this paper, we present a comprehensive survey of
the recent developments in various techniques for overcoming these
communication challenges. Specifically, we first identify key communication
challenges in edge AI systems. We then introduce communication-efficient
techniques, from both algorithmic and system perspectives for training and
inference tasks at the network edge. Potential future research directions are
also highlighted.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl