14 research outputs found

    Numerically stable coded matrix computations via circulant and rotation matrix embeddings

    Get PDF
    Several recent works have used coding-theoretic ideas for mitigating the effect of stragglers in distributed matrix computations (matrix-vector and matrix-matrix multiplication) over the reals. In particular, a polynomial code based approach distributes matrix-matrix multiplication among n worker nodes by means of polynomial evaluations. This allows for an ``optimal\u27\u27 recovery threshold whereby the intended result can be decoded as long as at least (n−s) worker nodes complete their tasks; s is the number of stragglers that the scheme can handle. However, a major issue with these approaches is the high condition number of the corresponding Vandermonde-structured recovery matrices. This presents serious numerical precision issues when decoding the desired result. It is well known that the condition number of real Vandermonde matrices grows exponentially in n. In contrast, the condition numbers of Vandermonde matrices with parameters on the unit circle are much better behaved. In this work we leverage the properties of circulant permutation matrices and rotation matrices to obtain coded computation schemes with significantly lower worst case condition numbers; these matrices have eigenvalues that lie on the unit circle. Our scheme is such that the associated recovery matrices have a condition number corresponding to Vandermonde matrices with parameters given by the eigenvalues of the corresponding circulant permutation and rotation matrices. We demonstrate an upper bound on the worst case condition number of these matrices which grows as ≈O(ns+6). In essence, we leverage the well-behaved conditioning of complex Vandermonde matrices with parameters on the unit circle, while still working with computation over the reals. Experimental results demonstrate that our proposed method has condition numbers that are several orders of magnitude better than prior work

    Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing

    Get PDF
    The current BigData era routinely requires the processing of large scale data on massive distributed computing clusters. Such large scale clusters often suffer from the problem of stragglers , which are defined as slow or failed nodes. The overall speed of a computational job on these clusters is typically dominated by stragglers in the absence of a sophisticated assignment of tasks to the worker nodes. In recent years, approaches based on coding theory (referred to as coded computation ) have been effectively used for straggler mitigation. Coded computation offers significant benefits for specific classes of problems such as distributed matrix computations (which play a crucial role in several parts of the machine learning pipeline). The essential idea is to create redundant tasks so that the desired result can be recovered as long as a certain number of worker nodes complete their tasks. In this survey article, we overview recent developments in the field of coding for straggler-resilient distributed matrix computations

    Codes for Distributed Machine Learning

    Get PDF
    The problem considered is that of distributing machine learning operations of matrix multiplication and multivariate polynomial evaluation among computer nodes a.k.a worker nodes some of whom don’t return their outputs or return erroneous outputs. The thesis can be divided into three major parts. In the first part of the thesis, a fault tolerant setup where t worker nodes return erroneous values is considered. For an additive random Gaussian error model, it is shown that for all t < N − K, errors can be corrected with probability 1 for polynomial codes. In the second part of the thesis, a class of codes called random Khatri-Rao-Product (RKRP) codes for distributed matrix multiplication in the presence of stragglers is proposed. The main advantage of the proposed codes is that decoding of RKRP codes is highly numerically stable in comparison to decoding of Polynomial codes [67] and decoding of the recently proposed OrthoPoly codes [18]. It is shown that RKRP codes are maximum distance separable with probability 1. In the third part of the thesis, the problem of distributed multivariate polynomial evaluation (DPME) is considered, where Lagrange Coded Computing (LCC) [66] was proposed as a coded computation scheme to provide resilience against stragglers for the DPME problem. A variant of the LCC scheme, termed Product Lagrange Coded Computing (PLCC) is proposed by combining ideas from classical product codes and LCC. The main advantage of PLCC is that they are more numerically stable than LCC

    Codes for Distributed Machine Learning

    Get PDF
    The problem considered is that of distributing machine learning operations of matrix multiplication and multivariate polynomial evaluation among computer nodes a.k.a worker nodes some of whom don’t return their outputs or return erroneous outputs. The thesis can be divided into three major parts. In the first part of the thesis, a fault tolerant setup where t worker nodes return erroneous values is considered. For an additive random Gaussian error model, it is shown that for all t < N − K, errors can be corrected with probability 1 for polynomial codes. In the second part of the thesis, a class of codes called random Khatri-Rao-Product (RKRP) codes for distributed matrix multiplication in the presence of stragglers is proposed. The main advantage of the proposed codes is that decoding of RKRP codes is highly numerically stable in comparison to decoding of Polynomial codes [67] and decoding of the recently proposed OrthoPoly codes [18]. It is shown that RKRP codes are maximum distance separable with probability 1. In the third part of the thesis, the problem of distributed multivariate polynomial evaluation (DPME) is considered, where Lagrange Coded Computing (LCC) [66] was proposed as a coded computation scheme to provide resilience against stragglers for the DPME problem. A variant of the LCC scheme, termed Product Lagrange Coded Computing (PLCC) is proposed by combining ideas from classical product codes and LCC. The main advantage of PLCC is that they are more numerically stable than LCC
    corecore