294 research outputs found
Slack Squeeze Coded Computing for Adaptive Straggler Mitigation
While performing distributed computations in today's cloud-based platforms,
execution speed variations among compute nodes can significantly reduce the
performance and create bottlenecks like stragglers. Coded computation
techniques leverage coding theory to inject computational redundancy and
mitigate stragglers in distributed computations. In this paper, we propose a
dynamic workload distribution strategy for coded computation called Slack
Squeeze Coded Computation (). squeezes the compute slack
(i.e., overhead) that is built into the coded computing frameworks by
efficiently assigning work for all fast and slow nodes according to their
speeds and without needing to re-distribute data. We implement an LSTM-based
speed prediction algorithm to predict speeds of compute nodes. We evaluate
on linear algebraic algorithms, gradient descent, graph ranking, and
graph filtering algorithms. We demonstrate 19% to 39% reduction in total
computation latency using compared to job replication and coded
computation. We further show how can be applied beyond matrix-vector
multiplication.Comment: 13 pages, SC 201
Straggler Mitigation by Delayed Relaunch of Tasks
Redundancy for straggler mitigation, originally in data download and more
recently in distributed computing context, has been shown to be effective both
in theory and practice. Analysis of systems with redundancy has drawn
significant attention and numerous papers have studied pain and gain of
redundancy under various service models and assumptions on the straggler
characteristics. We here present a cost (pain) vs. latency (gain) analysis of
using simple replication or erasure coding for straggler mitigation in
executing jobs with many tasks. We quantify the effect of the tail of task
execution times and discuss tail heaviness as a decisive parameter for the cost
and latency of using redundancy. Specifically, we find that coded redundancy
achieves better cost vs. latency tradeoff than simple replication and can yield
reduction in both cost and latency under less heavy tailed execution times. We
show that delaying redundancy is not effective in reducing cost and that
delayed relaunch of stragglers can yield significant reduction in cost and
latency. We validate these observations by comparing with the simulations that
use empirical distributions extracted from Google cluster data.Comment: Accepted for IFIP WG 7.3 Performance 2017. Nov. 14-16, 2017, New
York, NY US
Efficient Straggler Replication in Large-scale Parallel Computing
In a cloud computing job with many parallel tasks, the tasks on the slowest
machines (straggling tasks) become the bottleneck in the job completion.
Computing frameworks such as MapReduce and Spark tackle this by replicating the
straggling tasks and waiting for any one copy to finish. Despite being adopted
in practice, there is little analysis of how replication affects the latency
and the cost of additional computing resources. In this paper we provide a
framework to analyze this latency-cost trade-off and find the best replication
strategy by answering design questions such as: 1) when to replicate straggling
tasks, 2) how many replicas to launch, and 3) whether to kill the original copy
or not. Our analysis reveals that for certain execution time distributions, a
small amount of task replication can drastically reduce both latency as well as
the cost of computing resources. We also propose an algorithm to estimate the
latency and cost based on the empirical distribution of task execution time.
Evaluations using samples in the Google Cluster Trace suggest further latency
and cost reduction compared to the existing replication strategy used in
MapReduce.Comment: Submitted to ACM Transactions on Modeling and Performance Evaluation
of Computing System
Speeding Up Distributed Machine Learning Using Codes
Codes are widely used in many engineering applications to offer robustness
against noise. In large-scale systems there are several types of noise that can
affect the performance of distributed machine learning algorithms -- straggler
nodes, system failures, or communication bottlenecks -- but there has been
little interaction cutting across codes, machine learning, and distributed
systems. In this work, we provide theoretical insights on how coded solutions
can achieve significant gains compared to uncoded ones. We focus on two of the
most basic building blocks of distributed learning algorithms: matrix
multiplication and data shuffling. For matrix multiplication, we use codes to
alleviate the effect of stragglers, and show that if the number of homogeneous
workers is , and the runtime of each subtask has an exponential tail, coded
computation can speed up distributed matrix multiplication by a factor of . For data shuffling, we use codes to reduce communication bottlenecks,
exploiting the excess in storage. We show that when a constant fraction
of the data matrix can be cached at each worker, and is the number
of workers, \emph{coded shuffling} reduces the communication cost by a factor
of compared to uncoded shuffling, where
is the ratio of the cost of unicasting messages to users to
multicasting a common message (of the same size) to users. For instance,
if multicasting a message to users is as cheap as
unicasting a message to one user. We also provide experiment results,
corroborating our theoretical gains of the coded algorithms.Comment: This work is published in IEEE Transactions on Information Theory and
presented in part at the NIPS 2015 Workshop on Machine Learning Systems and
the IEEE ISIT 201
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
MLaaS (ML-as-a-Service) offerings by cloud computing platforms are becoming
increasingly popular. Hosting pre-trained machine learning models in the cloud
enables elastic scalability as the demand grows. But providing low latency and
reducing the latency variance is a key requirement. Variance is harder to
control in a cloud deployment due to uncertainties in resource allocations
across many virtual instances. We propose the collage inference technique which
uses a novel convolutional neural network model, collage-cnn, to provide
low-cost redundancy. A collage-cnn model takes a collage image formed by
combining multiple images and performs multi-image classification in one shot,
albeit at slightly lower accuracy. We augment a collection of traditional
single image classifier models with a single collage-cnn classifier which acts
as their low-cost redundant backup. Collage-cnn provides backup classification
results if any single image classification requests experience slowdown.
Deploying the collage-cnn models in the cloud, we demonstrate that the 99th
percentile tail latency of inference can be reduced by 1.2x to 2x compared to
replication based approaches while providing high accuracy. Variation in
inference latency can be reduced by 1.8x to 15x.Comment: 10 pages, Under submissio
Effective Straggler Mitigation: Which Clones Should Attack and When?
Redundancy for straggler mitigation, originally in data download and more
recently in distributed computing context, has been shown to be effective both
in theory and practice. Analysis of systems with redundancy has drawn
significant attention and numerous papers have studied pain and gain of
redundancy under various service models and assumptions on the straggler
characteristics. We here present a cost (pain) vs. latency (gain) analysis of
using simple replication or erasure coding for straggler mitigation in
executing jobs with many tasks. We quantify the effect of the tail of task
execution times and discuss tail heaviness as a decisive parameter for the cost
and latency of using redundancy. Specifically, we find that coded redundancy
achieves better cost vs. latency and allows for greater achievable latency and
cost tradeoff region compared to replication and can yield reduction in both
cost and latency under less heavy tailed execution times. We show that delaying
redundancy is not effective in reducing cost.Comment: Published at MAMA Workshop in conjunction with ACM Sigmetrics, June
5, 201
Optimal Server Selection for Straggler Mitigation
The performance of large-scale distributed compute systems is adversely
impacted by stragglers when the execution time of a job is uncertain. To manage
stragglers, we consider a multi-fork approach for job scheduling, where
additional parallel servers are added at forking instants. In terms of the
forking instants and the number of additional servers, we compute the job
completion time and the cost of server utilization when the task processing
times are assumed to have a shifted exponential distribution. We use this study
to provide insights into the scheduling design of the forking instants and the
associated number of additional servers to be started. Numerical results
demonstrate orders of magnitude improvement in cost in the regime of low
completion times as compared to the prior works
"Short-Dot": Computing Large Linear Transforms Distributedly Using Coded Short Dot Products
Faced with saturation of Moore's law and increasing dimension of data, system
designers have increasingly resorted to parallel and distributed computing.
However, distributed computing is often bottle necked by a small fraction of
slow processors called "stragglers" that reduce the speed of computation
because the fusion node has to wait for all processors to finish. To combat the
effect of stragglers, recent literature introduces redundancy in computations
across processors, e.g.,~using repetition-based strategies or erasure codes.
The fusion node can exploit this redundancy by completing the computation using
outputs from only a subset of the processors, ignoring the stragglers. In this
paper, we propose a novel technique -- that we call "Short-Dot" -- to introduce
redundant computations in a coding theory inspired fashion, for computing
linear transforms of long vectors. Instead of computing long dot products as
required in the original linear transform, we construct a larger number of
redundant and short dot products that can be computed faster and more
efficiently at individual processors. In reference to comparable schemes that
introduce redundancy to tackle stragglers, Short-Dot reduces the cost of
computation, storage and communication since shorter portions are stored and
computed at each processor, and also shorter portions of the input is
communicated to each processor. We demonstrate through probabilistic analysis
as well as experiments that Short-Dot offers significant speed-up compared to
existing techniques. We also derive trade-offs between the length of the
dot-products and the resilience to stragglers (number of processors to wait
for), for any such strategy and compare it to that achieved by our strategy.Comment: Presented at NIPS 2016, Barcelona, Spai
Efficient Replication of Queued Tasks for Latency Reduction in Cloud Systems
In cloud computing systems, assigning a job to multiple servers and waiting
for the earliest copy to finish is an effective method to combat the
variability in response time of individual servers. Although adding redundant
replicas always reduces service time, the total computing time spent per job
may be higher, thus increasing waiting time in queue. The total time spent per
job is also proportional to the cost of computing resources. We analyze how
different redundancy strategies, for eg. number of replicas, and the time when
they are issued and canceled, affect the latency and computing cost. We get the
insight that the log-concavity of the service time distribution is a key factor
in determining whether adding redundancy reduces latency and cost. If the
service distribution is log-convex, then adding maximum redundancy reduces both
latency and cost. And if it is log-concave, then having fewer replicas and
canceling the redundant requests early is more effective.Comment: presented at Allerton 2015. arXiv admin note: substantial text
overlap with arXiv:1508.0359
A Survey on Large Scale Metadata Server for Big Data Storage
Big Data is defined as high volume of variety of data with an exponential
data growth rate. Data are amalgamated to generate revenue, which results a
large data silo. Data are the oils of modern IT industries. Therefore, the data
are growing at an exponential pace. The access mechanism of these data silos
are defined by metadata. The metadata are decoupled from data server for
various beneficial reasons. For instance, ease of maintenance. The metadata are
stored in metadata server (MDS). Therefore, the study on the MDS is mandatory
in designing of a large scale storage system. The MDS requires many parameters
to augment with its architecture. The architecture of MDS depends on the demand
of the storage system's requirements. Thus, MDS is categorized in various ways
depending on the underlying architecture and design methodology. The article
surveys on the various kinds of MDS architecture, designs, and methodologies.
This article emphasizes on clustered MDS (cMDS) and the reports are prepared
based on a) Bloom filterbased MDS, b) Clientfunded MDS, c) Geoaware
MDS, d) Cacheaware MDS, e) Loadaware MDS, f) Hashbased MDS, and g)
Treebased MDS. Additionally, the article presents the issues and challenges
of MDS for mammoth sized data.Comment: Submitted to ACM for possible publicatio
- β¦