17 research outputs found
Fault-Tolerant Strassen-Like Matrix Multiplication
In this study, we propose a simple method for fault-tolerant Strassen-like
matrix multiplications. The proposed method is based on using two distinct
Strassen-like algorithms instead of replicating a given one. We have realized
that using two different algorithms, new check relations arise resulting in
more local computations. These local computations are found using computer
aided search. To improve performance, special parity (extra) sub-matrix
multiplications (PSMMs) are generated (two of them) at the expense of
increasing communication/computation cost of the system. Our preliminary
results demonstrate that the proposed method outperforms a Strassen-like
algorithm with two copies and secures a very close performance to three copy
version using only 2 PSMMs, reducing the total number of compute nodes by
around 24\% i.e., from 21 to 16.Comment: 6 pages, 2 figure
Exploitation of Stragglers in Coded Computation
In cloud computing systems slow processing nodes, often referred to as
"stragglers", can significantly extend the computation time. Recent results
have shown that error correction coding can be used to reduce the effect of
stragglers. In this work we introduce a scheme that, in addition to using error
correction to distribute mixed jobs across nodes, is also able to exploit the
work completed by all nodes, including stragglers. We first consider
vector-matrix multiplication and apply maximum distance separable (MDS) codes
to small blocks of sub-matrices. The worker nodes process blocks sequentially,
working block-by-block, transmitting partial per-block results to the master as
they are completed. Sub-blocking allows a more continuous completion process,
which thereby allows us to exploit the work of a much broader spectrum of
processors and reduces computation time. We then apply this technique to
matrix-matrix multiplication using product code. In this case, we show that the
order of computing sub-tasks is a new degree of design freedom that can be
exploited to reduce computation time further. We propose a novel approach to
analyze the finishing time, which is different from typical order statistics.
Simulation results show that the expected computation time decreases by a
factor of at least two in compared to previous methods
Hierarchical Coded Computation
Coded computation is a method to mitigate "stragglers" in distributed
computing systems through the use of error correction coding that has lately
received significant attention. First used in vector-matrix multiplication, the
range of application was later extended to include matrix-matrix
multiplication, heterogeneous networks, convolution, and approximate computing.
A drawback to previous results is they completely ignore work completed by
stragglers. While stragglers are slower compute nodes, in many settings the
amount of work completed by stragglers can be non-negligible. Thus, in this
work, we propose a hierarchical coded computation method that exploits the work
completed by all compute nodes. We partition each node's computation into
layers of sub-computations such that each layer can be treated as (distinct)
erasure channel. We then design different erasure codes for each layer so that
all layers have the same failure exponent. We propose design guidelines to
optimize parameters of such codes. Numerical results show the proposed scheme
has an improvement of a factor of 1.5 in the expected finishing time compared
to previous work