Distributed matrix multiplication is widely used in several scientific
domains. It is well recognized that computation times on distributed clusters
are often dominated by the slowest workers (called stragglers). Recent work has
demonstrated that straggler mitigation can be viewed as a problem of designing
erasure codes. For matrices A and B, the technique
essentially maps the computation of ATB into the
multiplication of smaller (coded) submatrices. The stragglers are treated as
erasures in this process. The computation can be completed as long as a certain
number of workers (called the recovery threshold) complete their assigned
tasks.
We present a novel coding strategy for this problem when the absolute values
of the matrix entries are sufficiently small. We demonstrate a tradeoff between
the assumed absolute value bounds on the matrix entries and the recovery
threshold. At one extreme, we are optimal with respect to the recovery
threshold and on the other extreme, we match the threshold of prior work.
Experimental results on cloud-based clusters validate the benefits of our
method