13 research outputs found
Straggler Resilient Serverless Computing Based on Polar Codes
We propose a serverless computing mechanism for distributed computation based
on polar codes. Serverless computing is an emerging cloud based computation
model that lets users run their functions on the cloud without provisioning or
managing servers. Our proposed approach is a hybrid computing framework that
carries out computationally expensive tasks such as linear algebraic operations
involving large-scale data using serverless computing and does the rest of the
processing locally. We address the limitations and reliability issues of
serverless platforms such as straggling workers using coding theory, drawing
ideas from recent literature on coded computation. The proposed mechanism uses
polar codes to ensure straggler-resilience in a computationally effective
manner. We provide extensive evidence showing polar codes outperform other
coding methods. We have designed a sequential decoder specifically for polar
codes in erasure channels with full-precision input and outputs. In addition,
we have extended the proposed method to the matrix multiplication case where
both matrices being multiplied are coded. The proposed coded computation scheme
is implemented for AWS Lambda. Experiment results are presented where the
performance of the proposed coded computation technique is tested in
optimization via gradient descent. Finally, we introduce the idea of partial
polarization which reduces the computational burden of encoding and decoding at
the expense of straggler-resilience.Comment: New results added in the new version. More discussion on serverless
computin
OverSketched Newton: Fast Convex Optimization for Serverless Systems
Motivated by recent developments in serverless systems for large-scale
computation as well as improvements in scalable randomized matrix algorithms,
we develop OverSketched Newton, a randomized Hessian-based optimization
algorithm to solve large-scale convex optimization problems in serverless
systems. OverSketched Newton leverages matrix sketching ideas from Randomized
Numerical Linear Algebra to compute the Hessian approximately. These sketching
methods lead to inbuilt resiliency against stragglers that are a characteristic
of serverless architectures. Depending on whether the problem is strongly
convex or not, we propose different iteration updates using the approximate
Hessian. For both cases, we establish convergence guarantees for OverSketched
Newton and empirically validate our results by solving large-scale supervised
learning problems on real-world datasets. Experiments demonstrate a reduction
of ~50% in total running time on AWS Lambda, compared to state-of-the-art
distributed optimization schemes.Comment: 37 pages, 12 figure
An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation
We propose a novel application of coded computing to the problem of the
nearest neighbor estimation using MatDot Codes [Fahim. et.al. 2017], that are
known to be optimal for matrix multiplication in terms of recovery threshold
under storage constraints. In approximate nearest neighbor algorithms, it is
common to construct efficient in-memory indexes to improve query response time.
One such strategy is Multiple Random Projection Trees (MRPT), which reduces the
set of candidate points over which Euclidean distance calculations are
performed. However, this may result in a high memory footprint and possibly
paging penalties for large or high-dimensional data. Here we propose two
techniques to parallelize MRPT, that exploit data and model parallelism
respectively, by dividing both the data storage and the computation efforts
among different nodes in a distributed computing cluster. This is especially
critical when a single compute node cannot hold the complete dataset in memory.
We also propose a novel coded computation strategy based on MatDot codes for
the model-parallel architecture that, in a straggler-prone environment,
achieves the storage-optimal recovery threshold, i.e., the number of nodes that
are required to serve a query. We experimentally demonstrate that, in the
absence of straggling, our distributed approaches require less query time than
execution on a single processing node, providing near-linear speedups with
respect to the number of worker nodes. Through our experiments on real systems
with simulated straggling, we also show that our strategy achieves a faster
query execution than the uncoded strategy in a straggler-prone environment.Comment: Accepted for publication at the IEEE Big Data 201
Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication
We propose two coding schemes for distributed matrix multiplication in the
presence of stragglers. These coding schemes are adaptations of LT codes and
Raptor codes to distributed matrix multiplication and are termed \emph{factored
LT (FLT) codes} and \emph{factored Raptor (FR) codes}. Empirically, we show
that FLT codes have near-optimal recovery thresholds when the number of worker
nodes is very large, and that FR codes have excellent recovery thresholds while
the number of worker nodes is moderately large. FLT and FR codes have better
recovery thresholds when compared to Product codes and they are expected to
have better numerical stability when compared to Polynomial codes, while they
can also be decoded with a low-complexity decoding algorithm
A Survey of Coded Distributed Computing
Distributed computing has become a common approach for large-scale
computation of tasks due to benefits such as high reliability, scalability,
computation speed, and costeffectiveness. However, distributed computing faces
critical issues related to communication load and straggler effects. In
particular, computing nodes need to exchange intermediate results with each
other in order to calculate the final result, and this significantly increases
communication overheads. Furthermore, a distributed computing network may
include straggling nodes that run intermittently slower. This results in a
longer overall time needed to execute the computation tasks, thereby limiting
the performance of distributed computing. To address these issues, coded
distributed computing (CDC), i.e., a combination of coding theoretic techniques
and distributed computing, has been recently proposed as a promising solution.
Coding theoretic techniques have proved effective in WiFi and cellular systems
to deal with channel noise. Therefore, CDC may significantly reduce
communication load, alleviate the effects of stragglers, provide
fault-tolerance, privacy and security. In this survey, we first introduce the
fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze
a number of CDC approaches proposed to reduce the communication costs, mitigate
the straggler effects, and guarantee privacy and security. Furthermore, we
present and discuss applications of CDC in modern computer networks. Finally,
we highlight important challenges and promising research directions related to
CD
Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication
Large-scale machine learning and data mining methods routinely distribute
computations across multiple agents to parallelize processing. The time
required for the computations at the agents is affected by the availability of
local resources and/or poor channel conditions giving rise to the "straggler
problem". As a remedy to this problem, we employ Unequal Error Protection (UEP)
codes to obtain an approximation of the matrix product in the distributed
computation setting to provide higher protection for the blocks with higher
effect on the final result. We characterize the performance of the proposed
approach from a theoretical perspective by bounding the expected reconstruction
error for matrices with uncorrelated entries. We also apply the proposed coding
strategy to the computation of the back-propagation step in the training of a
Deep Neural Network (DNN) for an image classification task in the evaluation of
the gradients. Our numerical experiments show that it is indeed possible to
obtain significant improvements in the overall time required to achieve the DNN
training convergence by producing approximation of matrix products using UEP
codes in the presence of stragglers.Comment: 16 pages. arXiv admin note: text overlap with arXiv:2011.0274
On the Optimal Recovery Threshold of Coded Matrix Multiplication
We provide novel coded computation strategies for distributed matrix-matrix
products that outperform the recent "Polynomial code" constructions in recovery
threshold, i.e., the required number of successful workers. When -th
fraction of each matrix can be stored in each worker node, Polynomial codes
require successful workers, while our MatDot codes only require
successful workers, albeit at a higher communication cost from each worker to
the fusion node. We also provide a systematic construction of MatDot codes.
Further, we propose "PolyDot" coding that interpolates between Polynomial codes
and MatDot codes to trade off communication cost and recovery threshold.
Finally, we demonstrate a coding technique for multiplying matrices () by applying MatDot and PolyDot coding ideas.Comment: Extended version of the paper that appeared at Allerton 2017 (October
2017), including full proofs and further results. Submitted to IEEE
Transactions on Information Theor
Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters
Recently, coding has been a useful technique to mitigate the effect of
stragglers in distributed computing. However, coding in this context has been
mainly explored under the assumption of homogeneous workers, although the
real-world computing clusters can be often composed of heterogeneous workers
that have different computing capabilities. The uniform load allocation without
the awareness of heterogeneity possibly causes a significant loss in latency.
In this paper, we suggest the optimal load allocation for coded distributed
computing with heterogeneous workers. Specifically, we focus on the scenario
that there exist workers having the same computing capability, which can be
regarded as a group for analysis. We rely on the lower bound on the expected
latency and obtain the optimal load allocation by showing that our proposed
load allocation achieves the minimum of the lower bound for a sufficiently
large number of workers. From numerical simulations, when assuming the group
heterogeneity, our load allocation reduces the expected latency by orders of
magnitude over the existing load allocation scheme
CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors
This work proposes the first strategy to make distributed training of neural
networks resilient to computing errors, a problem that has remained unsolved
despite being first posed in 1956 by von Neumann. He also speculated that the
efficiency and reliability of the human brain is obtained by allowing for low
power but error-prone components with redundancy for error-resilience. It is
surprising that this problem remains open, even as massive artificial neural
networks are being trained on increasingly low-cost and unreliable processing
units. Our coding-theory-inspired strategy, "CodeNet," solves this problem by
addressing three challenges in the science of reliable computing: (i) Providing
the first strategy for error-resilient neural network training by encoding each
layer separately; (ii) Keeping the overheads of coding
(encoding/error-detection/decoding) low by obviating the need to re-encode the
updated parameter matrices after each iteration from scratch. (iii) Providing a
completely decentralized implementation with no central node (which is a single
point of failure), allowing all primary computational steps to be error-prone.
We theoretically demonstrate that CodeNet has higher error tolerance than
replication, which we leverage to speed up computation time. Simultaneously,
CodeNet requires lower redundancy than replication, and equal computational and
communication costs in scaling sense. We first demonstrate the benefits of
CodeNet in reducing expected computation time over replication when accounting
for checkpointing. Our experiments show that CodeNet achieves the best
accuracy-runtime tradeoff compared to both replication and uncoded strategies.
CodeNet is a significant step towards biologically plausible neural network
training, that could hold the key to orders of magnitude efficiency
improvements.Comment: Currently under revie
Codes for Distributed Machine Learning
The problem considered is that of distributing machine learning operations of matrix multiplication and multivariate polynomial evaluation among computer nodes a.k.a worker nodes some of whom don’t return their outputs or return erroneous outputs. The thesis can be divided into three major parts.
In the first part of the thesis, a fault tolerant setup where t worker nodes return erroneous values is considered. For an additive random Gaussian error model, it is shown that for all t < N − K, errors can be corrected with probability 1 for polynomial codes.
In the second part of the thesis, a class of codes called random Khatri-Rao-Product (RKRP) codes for distributed matrix multiplication in the presence of stragglers is proposed. The main advantage of the proposed codes is that decoding of RKRP codes is highly numerically stable in comparison to decoding of Polynomial codes [67] and decoding of the recently proposed OrthoPoly codes [18]. It is shown that RKRP codes are maximum distance separable with probability 1.
In the third part of the thesis, the problem of distributed multivariate polynomial evaluation (DPME) is considered, where Lagrange Coded Computing (LCC) [66] was proposed as a coded computation scheme to provide resilience against stragglers for the DPME problem. A variant of the LCC scheme, termed Product Lagrange Coded Computing (PLCC) is proposed by combining ideas from classical product codes and LCC. The main advantage of PLCC is that they are more numerically stable than LCC