13 research outputs found

    Straggler Resilient Serverless Computing Based on Polar Codes

    Full text link
    We propose a serverless computing mechanism for distributed computation based on polar codes. Serverless computing is an emerging cloud based computation model that lets users run their functions on the cloud without provisioning or managing servers. Our proposed approach is a hybrid computing framework that carries out computationally expensive tasks such as linear algebraic operations involving large-scale data using serverless computing and does the rest of the processing locally. We address the limitations and reliability issues of serverless platforms such as straggling workers using coding theory, drawing ideas from recent literature on coded computation. The proposed mechanism uses polar codes to ensure straggler-resilience in a computationally effective manner. We provide extensive evidence showing polar codes outperform other coding methods. We have designed a sequential decoder specifically for polar codes in erasure channels with full-precision input and outputs. In addition, we have extended the proposed method to the matrix multiplication case where both matrices being multiplied are coded. The proposed coded computation scheme is implemented for AWS Lambda. Experiment results are presented where the performance of the proposed coded computation technique is tested in optimization via gradient descent. Finally, we introduce the idea of partial polarization which reduces the computational burden of encoding and decoding at the expense of straggler-resilience.Comment: New results added in the new version. More discussion on serverless computin

    OverSketched Newton: Fast Convex Optimization for Serverless Systems

    Full text link
    Motivated by recent developments in serverless systems for large-scale computation as well as improvements in scalable randomized matrix algorithms, we develop OverSketched Newton, a randomized Hessian-based optimization algorithm to solve large-scale convex optimization problems in serverless systems. OverSketched Newton leverages matrix sketching ideas from Randomized Numerical Linear Algebra to compute the Hessian approximately. These sketching methods lead to inbuilt resiliency against stragglers that are a characteristic of serverless architectures. Depending on whether the problem is strongly convex or not, we propose different iteration updates using the approximate Hessian. For both cases, we establish convergence guarantees for OverSketched Newton and empirically validate our results by solving large-scale supervised learning problems on real-world datasets. Experiments demonstrate a reduction of ~50% in total running time on AWS Lambda, compared to state-of-the-art distributed optimization schemes.Comment: 37 pages, 12 figure

    An Application of Storage-Optimal MatDot Codes for Coded Matrix Multiplication: Fast k-Nearest Neighbors Estimation

    Full text link
    We propose a novel application of coded computing to the problem of the nearest neighbor estimation using MatDot Codes [Fahim. et.al. 2017], that are known to be optimal for matrix multiplication in terms of recovery threshold under storage constraints. In approximate nearest neighbor algorithms, it is common to construct efficient in-memory indexes to improve query response time. One such strategy is Multiple Random Projection Trees (MRPT), which reduces the set of candidate points over which Euclidean distance calculations are performed. However, this may result in a high memory footprint and possibly paging penalties for large or high-dimensional data. Here we propose two techniques to parallelize MRPT, that exploit data and model parallelism respectively, by dividing both the data storage and the computation efforts among different nodes in a distributed computing cluster. This is especially critical when a single compute node cannot hold the complete dataset in memory. We also propose a novel coded computation strategy based on MatDot codes for the model-parallel architecture that, in a straggler-prone environment, achieves the storage-optimal recovery threshold, i.e., the number of nodes that are required to serve a query. We experimentally demonstrate that, in the absence of straggling, our distributed approaches require less query time than execution on a single processing node, providing near-linear speedups with respect to the number of worker nodes. Through our experiments on real systems with simulated straggling, we also show that our strategy achieves a faster query execution than the uncoded strategy in a straggler-prone environment.Comment: Accepted for publication at the IEEE Big Data 201

    Factored LT and Factored Raptor Codes for Large-Scale Distributed Matrix Multiplication

    Full text link
    We propose two coding schemes for distributed matrix multiplication in the presence of stragglers. These coding schemes are adaptations of LT codes and Raptor codes to distributed matrix multiplication and are termed \emph{factored LT (FLT) codes} and \emph{factored Raptor (FR) codes}. Empirically, we show that FLT codes have near-optimal recovery thresholds when the number of worker nodes is very large, and that FR codes have excellent recovery thresholds while the number of worker nodes is moderately large. FLT and FR codes have better recovery thresholds when compared to Product codes and they are expected to have better numerical stability when compared to Polynomial codes, while they can also be decoded with a low-complexity decoding algorithm

    A Survey of Coded Distributed Computing

    Full text link
    Distributed computing has become a common approach for large-scale computation of tasks due to benefits such as high reliability, scalability, computation speed, and costeffectiveness. However, distributed computing faces critical issues related to communication load and straggler effects. In particular, computing nodes need to exchange intermediate results with each other in order to calculate the final result, and this significantly increases communication overheads. Furthermore, a distributed computing network may include straggling nodes that run intermittently slower. This results in a longer overall time needed to execute the computation tasks, thereby limiting the performance of distributed computing. To address these issues, coded distributed computing (CDC), i.e., a combination of coding theoretic techniques and distributed computing, has been recently proposed as a promising solution. Coding theoretic techniques have proved effective in WiFi and cellular systems to deal with channel noise. Therefore, CDC may significantly reduce communication load, alleviate the effects of stragglers, provide fault-tolerance, privacy and security. In this survey, we first introduce the fundamentals of CDC, followed by basic CDC schemes. Then, we review and analyze a number of CDC approaches proposed to reduce the communication costs, mitigate the straggler effects, and guarantee privacy and security. Furthermore, we present and discuss applications of CDC in modern computer networks. Finally, we highlight important challenges and promising research directions related to CD

    Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication

    Full text link
    Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or poor channel conditions giving rise to the "straggler problem". As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradients. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing approximation of matrix products using UEP codes in the presence of stragglers.Comment: 16 pages. arXiv admin note: text overlap with arXiv:2011.0274

    On the Optimal Recovery Threshold of Coded Matrix Multiplication

    Full text link
    We provide novel coded computation strategies for distributed matrix-matrix products that outperform the recent "Polynomial code" constructions in recovery threshold, i.e., the required number of successful workers. When mm-th fraction of each matrix can be stored in each worker node, Polynomial codes require m2m^2 successful workers, while our MatDot codes only require 2m−12m-1 successful workers, albeit at a higher communication cost from each worker to the fusion node. We also provide a systematic construction of MatDot codes. Further, we propose "PolyDot" coding that interpolates between Polynomial codes and MatDot codes to trade off communication cost and recovery threshold. Finally, we demonstrate a coding technique for multiplying nn matrices (n≥3n \geq 3) by applying MatDot and PolyDot coding ideas.Comment: Extended version of the paper that appeared at Allerton 2017 (October 2017), including full proofs and further results. Submitted to IEEE Transactions on Information Theor

    Optimal Load Allocation for Coded Distributed Computation in Heterogeneous Clusters

    Full text link
    Recently, coding has been a useful technique to mitigate the effect of stragglers in distributed computing. However, coding in this context has been mainly explored under the assumption of homogeneous workers, although the real-world computing clusters can be often composed of heterogeneous workers that have different computing capabilities. The uniform load allocation without the awareness of heterogeneity possibly causes a significant loss in latency. In this paper, we suggest the optimal load allocation for coded distributed computing with heterogeneous workers. Specifically, we focus on the scenario that there exist workers having the same computing capability, which can be regarded as a group for analysis. We rely on the lower bound on the expected latency and obtain the optimal load allocation by showing that our proposed load allocation achieves the minimum of the lower bound for a sufficiently large number of workers. From numerical simulations, when assuming the group heterogeneity, our load allocation reduces the expected latency by orders of magnitude over the existing load allocation scheme

    CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors

    Full text link
    This work proposes the first strategy to make distributed training of neural networks resilient to computing errors, a problem that has remained unsolved despite being first posed in 1956 by von Neumann. He also speculated that the efficiency and reliability of the human brain is obtained by allowing for low power but error-prone components with redundancy for error-resilience. It is surprising that this problem remains open, even as massive artificial neural networks are being trained on increasingly low-cost and unreliable processing units. Our coding-theory-inspired strategy, "CodeNet," solves this problem by addressing three challenges in the science of reliable computing: (i) Providing the first strategy for error-resilient neural network training by encoding each layer separately; (ii) Keeping the overheads of coding (encoding/error-detection/decoding) low by obviating the need to re-encode the updated parameter matrices after each iteration from scratch. (iii) Providing a completely decentralized implementation with no central node (which is a single point of failure), allowing all primary computational steps to be error-prone. We theoretically demonstrate that CodeNet has higher error tolerance than replication, which we leverage to speed up computation time. Simultaneously, CodeNet requires lower redundancy than replication, and equal computational and communication costs in scaling sense. We first demonstrate the benefits of CodeNet in reducing expected computation time over replication when accounting for checkpointing. Our experiments show that CodeNet achieves the best accuracy-runtime tradeoff compared to both replication and uncoded strategies. CodeNet is a significant step towards biologically plausible neural network training, that could hold the key to orders of magnitude efficiency improvements.Comment: Currently under revie

    Codes for Distributed Machine Learning

    Get PDF
    The problem considered is that of distributing machine learning operations of matrix multiplication and multivariate polynomial evaluation among computer nodes a.k.a worker nodes some of whom don’t return their outputs or return erroneous outputs. The thesis can be divided into three major parts. In the first part of the thesis, a fault tolerant setup where t worker nodes return erroneous values is considered. For an additive random Gaussian error model, it is shown that for all t < N − K, errors can be corrected with probability 1 for polynomial codes. In the second part of the thesis, a class of codes called random Khatri-Rao-Product (RKRP) codes for distributed matrix multiplication in the presence of stragglers is proposed. The main advantage of the proposed codes is that decoding of RKRP codes is highly numerically stable in comparison to decoding of Polynomial codes [67] and decoding of the recently proposed OrthoPoly codes [18]. It is shown that RKRP codes are maximum distance separable with probability 1. In the third part of the thesis, the problem of distributed multivariate polynomial evaluation (DPME) is considered, where Lagrange Coded Computing (LCC) [66] was proposed as a coded computation scheme to provide resilience against stragglers for the DPME problem. A variant of the LCC scheme, termed Product Lagrange Coded Computing (PLCC) is proposed by combining ideas from classical product codes and LCC. The main advantage of PLCC is that they are more numerically stable than LCC
    corecore