Search CORE

14 research outputs found

Numerically stable coded matrix computations via circulant and rotation matrix embeddings

Author: Ramamoorthy Aditya
Ramamoorthy Aditya
Tang Li
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2019
Field of study

Several recent works have used coding-theoretic ideas for mitigating the effect of stragglers in distributed matrix computations (matrix-vector and matrix-matrix multiplication) over the reals. In particular, a polynomial code based approach distributes matrix-matrix multiplication among n worker nodes by means of polynomial evaluations. This allows for an ``optimal\u27\u27 recovery threshold whereby the intended result can be decoded as long as at least (n−s) worker nodes complete their tasks; s is the number of stragglers that the scheme can handle. However, a major issue with these approaches is the high condition number of the corresponding Vandermonde-structured recovery matrices. This presents serious numerical precision issues when decoding the desired result. It is well known that the condition number of real Vandermonde matrices grows exponentially in n. In contrast, the condition numbers of Vandermonde matrices with parameters on the unit circle are much better behaved. In this work we leverage the properties of circulant permutation matrices and rotation matrices to obtain coded computation schemes with significantly lower worst case condition numbers; these matrices have eigenvalues that lie on the unit circle. Our scheme is such that the associated recovery matrices have a condition number corresponding to Vandermonde matrices with parameters given by the eigenvalues of the corresponding circulant permutation and rotation matrices. We demonstrate an upper bound on the worst case condition number of these matrices which grows as ≈O(ns+6). In essence, we leverage the well-behaved conditioning of complex Vandermonde matrices with parameters on the unit circle, while still working with computation over the reals. Experimental results demonstrate that our proposed method has condition numbers that are several orders of magnitude better than prior work

Digital Repository @ Iowa State University (ISU)

Straggler-Resistant Distributed Matrix Computation via Coding Theory: Removing a Bottleneck in Large-Scale Data Processing

Author: Das Anindya Bijoy
Ramamoorthy Aditya
Ramamoorthy Aditya
Tang Li
Publication venue: Iowa State University Digital Repository
Publication date: 01/05/2020
Field of study

The current BigData era routinely requires the processing of large scale data on massive distributed computing clusters. Such large scale clusters often suffer from the problem of stragglers , which are defined as slow or failed nodes. The overall speed of a computational job on these clusters is typically dominated by stragglers in the absence of a sophisticated assignment of tasks to the worker nodes. In recent years, approaches based on coding theory (referred to as coded computation ) have been effectively used for straggler mitigation. Coded computation offers significant benefits for specific classes of problems such as distributed matrix computations (which play a crucial role in several parts of the machine learning pipeline). The essential idea is to create redundant tasks so that the desired result can be recovered as long as a certain number of worker nodes complete their tasks. In this survey article, we overview recent developments in the field of coding for straggler-resilient distributed matrix computations

Digital Repository @ Iowa State University (ISU)

Codes for Distributed Machine Learning

Author: Muthuveeru Subramaniam Adarsh
Publication venue
Publication date: 06/01/2021
Field of study

The problem considered is that of distributing machine learning operations of matrix multiplication and multivariate polynomial evaluation among computer nodes a.k.a worker nodes some of whom don’t return their outputs or return erroneous outputs. The thesis can be divided into three major parts. In the first part of the thesis, a fault tolerant setup where t worker nodes return erroneous values is considered. For an additive random Gaussian error model, it is shown that for all t < N − K, errors can be corrected with probability 1 for polynomial codes. In the second part of the thesis, a class of codes called random Khatri-Rao-Product (RKRP) codes for distributed matrix multiplication in the presence of stragglers is proposed. The main advantage of the proposed codes is that decoding of RKRP codes is highly numerically stable in comparison to decoding of Polynomial codes [67] and decoding of the recently proposed OrthoPoly codes [18]. It is shown that RKRP codes are maximum distance separable with probability 1. In the third part of the thesis, the problem of distributed multivariate polynomial evaluation (DPME) is considered, where Lagrange Coded Computing (LCC) [66] was proposed as a coded computation scheme to provide resilience against stragglers for the DPME problem. A variant of the LCC scheme, termed Product Lagrange Coded Computing (PLCC) is proposed by combining ideas from classical product codes and LCC. The main advantage of PLCC is that they are more numerically stable than LCC

Texas A&M Repository

Codes for Distributed Machine Learning

Author: Muthuveeru Subramaniam Adarsh
Publication venue
Publication date: 06/01/2021
Field of study

Texas A&M Repository

Recommended from our members

Flexible Cross-Subspace Alignment Codes for Variable Coded Distributed Batch Matrix Multiplication

Author: Tauz Lev
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

Modern distributed systems suffer from a phenomenon known as stragglers where computation nodes either break-down or are sufficiently slow, resulting in a large tail latency. Inspired by error correcting codes, researchers within the field of coded computation combat stragglers by cleverly encoding the data within the computations. One major endeavor is in the study of coded matrix-matrix multiplication where the task is to multiply two large matrices in a distributed manner. Most coded matrix computation work focuses on highly structured tasks which allows for easier code construction and analysis but limits the applicability for more general problems. For the first time, we consider the novel problem of multiplying many different matrices whose products may share matrices with no guaranteed regularity. Specifically, we consider the Variable Coded Distributed Batch Matrix Multiplication (VCDBMM) problem where the system is given two sets of matrices

\mathcal{A}=\{A_1,A_2,\dots,A_{|\mathcal{A}|}\}

and

\mathcal{B}=\{B_1,B_2,\dots,B_{|\mathcal{B}|}\}

and a set of computation goals

\mathcal{S} = \{(i_1,j_1),(i_2,j_2), \dots (i_{|\mathcal{S}|},j_{|\mathcal{S}|})\}

and the objective is to calculate the matrix multiplication

A_iB_j

for every

(i,j) \in \mathcal{S}

in the presence of stragglers. Therefore, a good coding solution minimizes the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). Inspired by Cross-Subspace Alignment Codes, we construct Flexible Cross-Subspace Alignment Codes (FCSA) to solve the general VCDBMM problem. We provide two variants of FCSA codes that allow for a trade-off between the encoding/decoding complexity and the recovery threshold. We demonstrate that both variants are within a factor of two optimal under certain system constraints. We also generalize FCSA codes into Grouped FCSA codes where we group computations together to provide further flexibility between the computational complexity at the workers and the recovery threshold. We provide simulations on random instances of the VCDBMM problem and demonstrate the average improvement offered by our codes

eScholarship - University of California