15 research outputs found

    Erasure coding for distributed matrix multiplication for matrices with bounded entries

    Get PDF
    Distributed matrix multiplication is widely used in several scientific domains. It is well recognized that computation times on distributed clusters are often dominated by the slowest workers (called stragglers). Recent work has demonstrated that straggler mitigation can be viewed as a problem of designing erasure codes. For matrices A\mathbf A and B\mathbf B, the technique essentially maps the computation of ATB\mathbf A^T \mathbf B into the multiplication of smaller (coded) submatrices. The stragglers are treated as erasures in this process. The computation can be completed as long as a certain number of workers (called the recovery threshold) complete their assigned tasks. We present a novel coding strategy for this problem when the absolute values of the matrix entries are sufficiently small. We demonstrate a tradeoff between the assumed absolute value bounds on the matrix entries and the recovery threshold. At one extreme, we are optimal with respect to the recovery threshold and on the other extreme, we match the threshold of prior work. Experimental results on cloud-based clusters validate the benefits of our method

    Universally Decodable Matrices for Distributed Matrix-Vector Multiplication

    Get PDF
    Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.Comment: 6 pages, 1 figur

    Distributed Matrix-Vector Multiplication: A Convolutional Coding Approach

    Get PDF
    Distributed computing systems are well-known to suffer from the problem of slow or failed nodes; these are referred to as stragglers. Straggler mitigation (for distributed matrix computations) has recently been investigated from the standpoint of erasure coding in several works. In this work we present a strategy for distributed matrix-vector multiplication based on convolutional coding. Our scheme can be decoded using a low-complexity peeling decoder. The recovery process enjoys excellent numerical stability as compared to Reed-Solomon coding based approaches (which exhibit significant problems owing their badly conditioned decoding matrices). Finally, our schemes are better matched to the practically important case of sparse matrix-vector multiplication as compared to many previous schemes. Extensive simulation results corroborate our findings

    Algebraic approaches for coded caching and distributed computing

    Get PDF
    This dissertation examines the power of algebraic methods in two areas of modern interest: caching for large scale content distribution and straggler mitigation within distributed computation. Caching is a popular technique for facilitating large scale content delivery over the Internet. Traditionally, caching operates by storing popular content closer to the end users. Recent work within the domain of information theory demonstrates that allowing coding in the cache and coded transmission from the server (referred to as coded caching) to the end users can allow for significant reductions in the number of bits transmitted from the server to the end users. The first part of this dissertation examines problems within coded caching. The original formulation of the coded caching problem assumes that the server and the end users are connected via a single shared link. In Chapter 2, we consider a more general topology where there is a layer of relay nodes between the server and the users. We propose novel schemes for a class of such networks that satisfy a so-called resolvability property and demonstrate that the performance of our scheme is strictly better than previously proposed schemes. Moreover, the original coded caching scheme requires that each file hosted in the server be partitioned into a large number (i.e., the subpacketization level) of non-overlapping subfiles. From a practical perspective, this is problematic as it means that prior schemes are only applicable when the size of the files is extremely large. In Chapter 3, we propose a novel coded caching scheme that enjoys a significantly lower subpacketization level than prior schemes, while only suffering a marginal increase in the transmission rate. We demonstrate that several schemes with subpacketization levels that are exponentially smaller than the basic scheme can be obtained. The second half of this dissertation deals with large scale distributed matrix computations. Distributed matrix multiplication is an important problem, especially in domains such as deep learning of neural networks. It is well recognized that the computation times on distributed clusters are often dominated by the slowest workers (called stragglers). Recently, techniques from coding theory have found applications in straggler mitigation in the specific context of matrix-matrix and matrix-vector multiplication. The computation can be completed as long as a certain number of workers (called the recovery threshold) complete their assigned tasks. In Chapter 4, we consider matrix multiplication under the assumption that the absolute values of the matrix entries are sufficiently small. Under this condition, we present a method with a significantly smaller recovery threshold than prior work. Besides, the prior work suffers from serious numerical issues owing to the condition number of the corresponding real Vandermonde-structured recovery matrices; this condition number grows exponentially in the number of workers. In Chapter 5, we present a novel approach that leverages the properties of circulant permutation matrices and rotation matrices for coded matrix computation. In addition to having an optimal recovery threshold, we demonstrate an upper bound on the worst case condition number of our recovery matrices grows polynomially in the number of workers

    Universally Decodable Matrices for Distributed Matrix-Vector Multiplication

    Full text link
    Coded computation is an emerging research area that leverages concepts from erasure coding to mitigate the effect of stragglers (slow nodes) in distributed computation clusters, especially for matrix computation problems. In this work, we present a class of distributed matrix-vector multiplication schemes that are based on codes in the Rosenbloom-Tsfasman metric and universally decodable matrices. Our schemes take into account the inherent computation order within a worker node. In particular, they allow us to effectively leverage partial computations performed by stragglers (a feature that many prior works lack). An additional main contribution of our work is a companion matrix-based embedding of these codes that allows us to obtain sparse and numerically stable schemes for the problem at hand. Experimental results confirm the effectiveness of our techniques.Comment: 6 pages, 1 figur

    Numerically stable coded matrix computations via circulant and rotation matrix embeddings

    Get PDF
    Several recent works have used coding-theoretic ideas for mitigating the effect of stragglers in distributed matrix computations (matrix-vector and matrix-matrix multiplication) over the reals. In particular, a polynomial code based approach distributes matrix-matrix multiplication among n worker nodes by means of polynomial evaluations. This allows for an ``optimal\u27\u27 recovery threshold whereby the intended result can be decoded as long as at least (n−s) worker nodes complete their tasks; s is the number of stragglers that the scheme can handle. However, a major issue with these approaches is the high condition number of the corresponding Vandermonde-structured recovery matrices. This presents serious numerical precision issues when decoding the desired result. It is well known that the condition number of real Vandermonde matrices grows exponentially in n. In contrast, the condition numbers of Vandermonde matrices with parameters on the unit circle are much better behaved. In this work we leverage the properties of circulant permutation matrices and rotation matrices to obtain coded computation schemes with significantly lower worst case condition numbers; these matrices have eigenvalues that lie on the unit circle. Our scheme is such that the associated recovery matrices have a condition number corresponding to Vandermonde matrices with parameters given by the eigenvalues of the corresponding circulant permutation and rotation matrices. We demonstrate an upper bound on the worst case condition number of these matrices which grows as ≈O(ns+6). In essence, we leverage the well-behaved conditioning of complex Vandermonde matrices with parameters on the unit circle, while still working with computation over the reals. Experimental results demonstrate that our proposed method has condition numbers that are several orders of magnitude better than prior work
    corecore