355 research outputs found
Efficient Straggler Replication in Large-scale Parallel Computing
In a cloud computing job with many parallel tasks, the tasks on the slowest
machines (straggling tasks) become the bottleneck in the job completion.
Computing frameworks such as MapReduce and Spark tackle this by replicating the
straggling tasks and waiting for any one copy to finish. Despite being adopted
in practice, there is little analysis of how replication affects the latency
and the cost of additional computing resources. In this paper we provide a
framework to analyze this latency-cost trade-off and find the best replication
strategy by answering design questions such as: 1) when to replicate straggling
tasks, 2) how many replicas to launch, and 3) whether to kill the original copy
or not. Our analysis reveals that for certain execution time distributions, a
small amount of task replication can drastically reduce both latency as well as
the cost of computing resources. We also propose an algorithm to estimate the
latency and cost based on the empirical distribution of task execution time.
Evaluations using samples in the Google Cluster Trace suggest further latency
and cost reduction compared to the existing replication strategy used in
MapReduce.Comment: Submitted to ACM Transactions on Modeling and Performance Evaluation
of Computing System
Slack Squeeze Coded Computing for Adaptive Straggler Mitigation
While performing distributed computations in today's cloud-based platforms,
execution speed variations among compute nodes can significantly reduce the
performance and create bottlenecks like stragglers. Coded computation
techniques leverage coding theory to inject computational redundancy and
mitigate stragglers in distributed computations. In this paper, we propose a
dynamic workload distribution strategy for coded computation called Slack
Squeeze Coded Computation (). squeezes the compute slack
(i.e., overhead) that is built into the coded computing frameworks by
efficiently assigning work for all fast and slow nodes according to their
speeds and without needing to re-distribute data. We implement an LSTM-based
speed prediction algorithm to predict speeds of compute nodes. We evaluate
on linear algebraic algorithms, gradient descent, graph ranking, and
graph filtering algorithms. We demonstrate 19% to 39% reduction in total
computation latency using compared to job replication and coded
computation. We further show how can be applied beyond matrix-vector
multiplication.Comment: 13 pages, SC 201
Speeding Up Distributed Machine Learning Using Codes
Codes are widely used in many engineering applications to offer robustness
against noise. In large-scale systems there are several types of noise that can
affect the performance of distributed machine learning algorithms -- straggler
nodes, system failures, or communication bottlenecks -- but there has been
little interaction cutting across codes, machine learning, and distributed
systems. In this work, we provide theoretical insights on how coded solutions
can achieve significant gains compared to uncoded ones. We focus on two of the
most basic building blocks of distributed learning algorithms: matrix
multiplication and data shuffling. For matrix multiplication, we use codes to
alleviate the effect of stragglers, and show that if the number of homogeneous
workers is , and the runtime of each subtask has an exponential tail, coded
computation can speed up distributed matrix multiplication by a factor of . For data shuffling, we use codes to reduce communication bottlenecks,
exploiting the excess in storage. We show that when a constant fraction
of the data matrix can be cached at each worker, and is the number
of workers, \emph{coded shuffling} reduces the communication cost by a factor
of compared to uncoded shuffling, where
is the ratio of the cost of unicasting messages to users to
multicasting a common message (of the same size) to users. For instance,
if multicasting a message to users is as cheap as
unicasting a message to one user. We also provide experiment results,
corroborating our theoretical gains of the coded algorithms.Comment: This work is published in IEEE Transactions on Information Theory and
presented in part at the NIPS 2015 Workshop on Machine Learning Systems and
the IEEE ISIT 201
Gradient Coding Based on Block Designs for Mitigating Adversarial Stragglers
Distributed implementations of gradient-based methods, wherein a server
distributes gradient computations across worker machines, suffer from slow
running machines, called 'stragglers'. Gradient coding is a coding-theoretic
framework to mitigate stragglers by enabling the server to recover the gradient
sum in the presence of stragglers. 'Approximate gradient codes' are variants of
gradient codes that reduce computation and storage overhead per worker by
allowing the server to approximately reconstruct the gradient sum.
In this work, our goal is to construct approximate gradient codes that are
resilient to stragglers selected by a computationally unbounded adversary. Our
motivation for constructing codes to mitigate adversarial stragglers stems from
the challenge of tackling stragglers in massive-scale elastic and serverless
systems, wherein it is difficult to statistically model stragglers. Towards
this end, we propose a class of approximate gradient codes based on balanced
incomplete block designs (BIBDs). We show that the approximation error for
these codes depends only on the number of stragglers, and thus, adversarial
straggler selection has no advantage over random selection. In addition, the
proposed codes admit computationally efficient decoding at the server. Next, to
characterize fundamental limits of adversarial straggling, we consider the
notion of 'adversarial threshold' -- the smallest number of workers that an
adversary must straggle to inflict certain approximation error. We compute a
lower bound on the adversarial threshold, and show that codes based on
symmetric BIBDs maximize this lower bound among a wide class of codes, making
them excellent candidates for mitigating adversarial stragglers.Comment: Shorter version accepted in 2019 IEEE International Symposium on
Information Theory (ISIT
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming
increasingly popular. Hosting pre-trained machine learning models in the cloud
enables elastic scalability as the demand grows. But providing low latency and
reducing the latency variance is a key requirement. Variance is harder to
control in a cloud deployment due to uncertainties in resource allocations
across many virtual instances. We propose the collage inference technique which
uses a novel convolutional neural network model, collage-cnn, to provide
low-cost redundancy. A collage-cnn model takes a collage image formed by
combining multiple images and performs multi-image classification in one shot,
albeit at slightly lower accuracy. We augment a collection of traditional
single image classifier models with a single collage-cnn classifier which acts
as their low-cost redundant backup. Collage-cnn provides backup classification
results if any single image classification requests experience slowdown.
Deploying the collage-cnn models in the cloud, we demonstrate that the 99th
percentile tail latency of inference can be reduced by 1.2x to 2x compared to
replication based approaches while providing high accuracy. Variation in
inference latency can be reduced by 1.8x to 15x.Comment: 10 pages, Under submissio
Robust Gradient Descent via Moment Encoding with LDPC Codes
This paper considers the problem of implementing large-scale gradient descent
algorithms in a distributed computing setting in the presence of {\em
straggling} processors. To mitigate the effect of the stragglers, it has been
previously proposed to encode the data with an erasure-correcting code and
decode at the master server at the end of the computation. We, instead, propose
to encode the second-moment of the data with a low density parity-check (LDPC)
code. The iterative decoding algorithms for LDPC codes have very low
computational overhead and the number of decoding iterations can be made to
automatically adjust with the number of stragglers in the system. We show that
for a random model for stragglers, the proposed moment encoding based gradient
descent method can be viewed as the stochastic gradient descent method. This
allows us to obtain convergence guarantees for the proposed solution.
Furthermore, the proposed moment encoding based method is shown to outperform
the existing schemes in a real distributed computing setup
Optimal Server Selection for Straggler Mitigation
The performance of large-scale distributed compute systems is adversely
impacted by stragglers when the execution time of a job is uncertain. To manage
stragglers, we consider a multi-fork approach for job scheduling, where
additional parallel servers are added at forking instants. In terms of the
forking instants and the number of additional servers, we compute the job
completion time and the cost of server utilization when the task processing
times are assumed to have a shifted exponential distribution. We use this study
to provide insights into the scheduling design of the forking instants and the
associated number of additional servers to be started. Numerical results
demonstrate orders of magnitude improvement in cost in the regime of low
completion times as compared to the prior works
Chronos: A Unifying Optimization Framework for Speculative Execution of Deadline-critical MapReduce Jobs
Meeting desired application deadlines in cloud processing systems such as
MapReduce is crucial as the nature of cloud applications is becoming
increasingly mission-critical and deadline-sensitive. It has been shown that
the execution times of MapReduce jobs are often adversely impacted by a few
slow tasks, known as stragglers, which result in high latency and deadline
violations. While a number of strategies have been developed in existing work
to mitigate stragglers by launching speculative or clone task attempts, none of
them provides a quantitative framework that optimizes the speculative execution
for offering guaranteed Service Level Agreements (SLAs) to meet application
deadlines. In this paper, we bring several speculative scheduling strategies
together under a unifying optimization framework, called Chronos, which defines
a new metric, Probability of Completion before Deadlines (PoCD), to measure the
probability that MapReduce jobs meet their desired deadlines. We systematically
analyze PoCD for popular strategies including Clone, Speculative-Restart, and
Speculative-Resume, and quantify their PoCD in closed-form. The result
illuminates an important tradeoff between PoCD and the cost of speculative
execution, measured by the total (virtual) machine time required under
different strategies. We propose an optimization problem to jointly optimize
PoCD and execution cost in different strategies, and develop an algorithmic
solution that is guaranteed to be optimal. Chronos is prototyped on Hadoop
MapReduce and evaluated against three baseline strategies using both
experiments and trace-driven simulations, achieving 50% net utility increase
with up to 80% PoCD and 88% cost improvements
Communication-Computation Efficient Gradient Coding
This paper develops coding techniques to reduce the running time of
distributed learning tasks. It characterizes the fundamental tradeoff to
compute gradients (and more generally vector summations) in terms of three
parameters: computation load, straggler tolerance and communication cost. It
further gives an explicit coding scheme that achieves the optimal tradeoff
based on recursive polynomial constructions, coding both across data subsets
and vector components. As a result, the proposed scheme allows to minimize the
running time for gradient computations. Implementations are made on Amazon EC2
clusters using Python with mpi4py package. Results show that the proposed
scheme maintains the same generalization error while reducing the running time
by compared to uncoded schemes and compared to prior coded
schemes focusing only on stragglers (Tandon et al., ICML 2017)
Straggler Mitigation in Distributed Optimization Through Data Encoding
Slow running or straggler tasks can significantly reduce computation speed in
distributed computation. Recently, coding-theory-inspired approaches have been
applied to mitigate the effect of straggling, through embedding redundancy in
certain linear computational steps of the optimization algorithm, thus
completing the computation without waiting for the stragglers. In this paper,
we propose an alternate approach where we embed the redundancy directly in the
data itself, and allow the computation to proceed completely oblivious to
encoding. We propose several encoding schemes, and demonstrate that popular
batch algorithms, such as gradient descent and L-BFGS, applied in a
coding-oblivious manner, deterministically achieve sample path linear
convergence to an approximate solution of the original problem, using an
arbitrarily varying subset of the nodes at each iteration. Moreover, this
approximation can be controlled by the amount of redundancy and the number of
nodes used in each iteration. We provide experimental results demonstrating the
advantage of the approach over uncoded and data replication strategies.Comment: appeared at NIPS 201
- …