Abstract-This study presents a novel coded computation technique for parallel matrix-matrix product computation using hierarchical compute architectures that outperforms well known previous strategies in terms of total end-to-end execution time. The proposed method uses array codes to achieve this performance by distributing the encoding operation over the cluster (slave) nodes at the expense of increased master-slave communication. The matrix multiplication is performed using MDS array Belief Propagation (BP)-decodable codes based on pure XOR operations. The proposed scheme is shown to be configurable and suited for modern hierarchical compute architectures equipped with multiple nodes, each having multiple, independent and less capable processing units. In addition, to address scaling number of strugglers, asymptotic versions of the code is used and latency analysis is conducted. We shall demonstrate that the proposed scheme achieves order-optimal computation in both the sublinear as well as the linear regime in the size of the computed product from an end-to-end delay perspective while ensuring acceptable communication requirements that can be addressed by today's high speed network link infrastructures.
I. INTRODUCTION
As the data science finds its way for ubiquitous use in our most critical business strategies, our communities, social relationships and eventually in our daily lives, the amount of data and computation have demonstrated an unprecedented growth. The natural consequence of this is to judiciously distribute data and computation over large clusters of smaller and less capable processing units. Due to slow/laggy compute nodes however, known as straggles in literature, the main objective of distributed computing might be compromised i.e., total end-toend latency is severely impacted by the slowest workers in the network which may render distributed computation ineffective.
A. Previous Work
The primary study that tackles the stragglers is based on a parameter server concept in which machine learning algorithms are shown to scale on a distributed network [1] with the presence of faulty system behaviors. Later, the alternative idea of coded computation is proposed to provide computation redundancy in robust system design against the stragglers. More specifically, in order to economically use the compute infrastructure, a coded computation framework based on a family of Maximum Distance Separable (MDS) codes is proposed in [2] . Similarly, the same idea is exhaustively exercised in various computing tasks including large matrix multiplications, gradient computing [3] , convolutions [4] and Fourier transforms [5] . Despite the proposed scheme and associated coding considers large scale matrix multiplication as an example, it can be extended to other types of computation tasks that have matrix multiplication (dot products) at the core such as signal transformations and multi-class classifications.
B. Motivation and Contributions
The motivation for this study is two fold. First, most existing works have focused on small to moderate size matrix multiplication operation and thereby improving the worker node task runtime while the encoding/decoding times at the master node are assumed to be negligible. However, although the master node workload may be acceptable for small-scale (few tens of nodes) networks and moderately sized matrices, it shall be extremely prohibitive for large-scale (over thousands of nodes) compute tasks. Hence, a lower total execution time must be the real optimization criterion whereby the encoding and decoding processes need to be low complexity and parallelizable. While achieving better total execution time, the system should not lose recovery threshold performance due to stragglers. In our study, we address this issue by distributing the encoding operation over the compute cluster and allowing belief propagation (BP) -a.k.a. peeling decoder to resolve the actual matrix multiplication operation at the master. Secondly, modern compute nodes are equipped with multiple typically equal-quality cores such as Central Processing Unit (CPU) instances possessing over hundreds of physical cores or Graphical Processing Unit (GPU) instances with thousands of CUDA cores. By considering each core as standalone network processor, and the distinctive cost nature of communications between these cores and with other network nodes, it is intuitive that coding can be exploited for optimal utilization of the underlying infrastructure. Finally, the proposed scheme can also be considered as complementary coding strategy to previous strategies such as d-dimensional product code schemes [6] in which component codes can be constructed using MDS array BP-XOR codes [7] .
One of the main differentiations of this study is that it focuses on end-to-end user delay rather than the pure parallel task time and clearly demonstrates advantages/disadvantages of the proposed coded computation with scaling clusters both for sublinear as well as linear regime in the size of the product at hand. We recognize the fixed number of struggler proofing using the original array BP-XOR code constructions and further proposed to use asymptotical versions to address scaling struggler cases. In addition, we assumed a hierarchically clustered compute architecture with varying degrees of parallelism which aligns with the realistic infrastructures built today with Google and Amazon virtualized environments. Additionally, we consider generic matrix sizes instead of square matrices in this work.
II. CODED COMPUTE CLUSTER MODEL
Let us suppose that we multiply two large matrices A B where A ∈ F s×k and B ∈ F s×b where F denotes any field. For a generic matrix X, we use x i,: to denote the i-th row and x :,j the j-th column for the rest of the paper. Thus, computing A B amounts to kb dot products (not necessarily in F 2 ) which will be performed by n ≥ k nodes each equipped with M ≥ b processors in which only b processors are assigned to task execution by the local scheduler. Note that if M < b, then we can group rows of A and columns of B in such a way to equally distribute computation on the available processors in the network. In other words, for any m > 0, m < M satisfying m|b, b|k, we can multiply fractions of matrices where each processor unit performs small matrix multiplications of sizes mk/b × s and s × m where s is typically assumed to be large.
The summary of the coded compute cluster architecture is illustrated in Fig. 1 . The matrices are provided as inputs to the Generator unit (a.k.a. the master node) where the encoding operation typically takes place on p processors. However, the encode operation can alternatively be distributed over the cluster nodes if the associated coding structure allows so 1 . Encoded rows/columns of the matrices are communicated with the nodes of the cluster in which a total of bn processors compute the matrix multiplication together. We assume the cluster processors are slower than master's by a factor of c. Finally, the Sink unit collects a subset of processor outputs to initiate decoding i.e., putting together the final product in place. Generator and Sink are not necessarily two physical nodes as drawn in this figure. Rather, they are abstracted units which may reside in the same physical (master) node. By construction, having all processor outputs of any k out of n nodes will suffice to reconstruct A B. In the case of MDS array BP-XOR codes, alleviating the constraint of using all b processor outputs of each node will allow many more reconstruction possibilities which is not considered in this preliminary work.
As shown in Fig. 1 , the total execution time in our clustered distributed setting is given by the sum of encoding time T e , master-slave transmission time T ms , the overall slave task time where each slave completes its execution with time T i , slavemaster transmission time T sm and finally the decoding time at the master denoted by T d . Hence, we can express the overall latency similar to [6] as
where S is the set of all minimal decodeable subsets of 1, 2, . . . , nb processors. Most of the past work focus on minimizing the parallel task time while paying little if no attention to encode/decode times. Plus, some of the previous works favor parallel task execution times at the expense of consuming more bandwidth and master system CPU resources [8] . However, as the system size (as well as the matrix sizes) scales, the encode/decode times of the master node will become the main bottleneck of the overall system performance. In our work, T will represent the sum of computation times, i.e., T = T e + min S∈S max i∈S T i + T d which will constitute the main focus of this paper.
A. Array-coded Matrix Multiplication
For an [n, k, b, σ] MDS array BP-XOR code with the degree of each encoding symbol being less than or equal to σ, we encode kb information symbols in to a n-by-b array of coded symbols. The rate of the code is defined to be r = k/n. Just like product codes, we encode computation in two dimensions however with the exception that the encoding is not only in vertical and horizontal directions, both could be in any direction possible, which shall provide more flexibility between the distribution of encoding and the use of bandwidth. In addition, every computation task of the product-coded scheme involves only single dot product whereas our scheme performs a maximum of σ ≥ 1 dot products per processor yielding bσ dot products per node in the worst case. However in the product-coded scheme, master node performs heavy and unbalanced encoding operations which will be shown to be a bottleneck from an overall latency point of view. With the array BP-XOR codes, encoding operation is also distributed over the clustered computation network at the expense of increased bandwidth consumption between master node and the processors of the cluster. On the other hand, distributing the decoding task among the network nodes is still an open research topic.
B. Example: [5, 2, 2, 2] MDS array BP-XOR code
Let us suppose we would like to compute (with b = m = 2)
by using [5, 2, 2, 2] MDS array BP-XOR code given in Table  I with the designated computation distribution among n = 5 nodes each equipped with two processors. Note that as soon as any two out of five nodes complete their processing, master node will initiate a belief propagation decoding to put together A B. Although in this case 4 processor outputs are sufficient to reconstruct the result, we have the constraint to wait all processors to finish their execution for the given node. On the other hand, the recovery threshold of this code can be shown to be 7, i.e., any 7 processor executions will be sufficient to reconstruct the result in the worst case.
For comparison purposes, let us also give an example for a polynomial code [9] with similar parameters i.e., m = 2 and k = 4. Assume a j = a 1,: + ja 2,: and b j = b :,1 + j 2 b :,2 , then the jth processor shall compute the dot product a j b j as shown in Table II . Note that the distribution of the tasks can be done in any order as the computation of each task has the same complexity. Unlike MDS array BP-XOR codes, node processors in this case computes a single dot product given that the master first encodes and generates a j s and b j s. The recovery threshold of polynomial codes is given by m 2 = 4 achieving the minimum possible.
C. Limitations on the Code Rate and Asymptotic Extensions
Unfortunately, for a given (k, b, σ) triple, there is an upper bound on the achievable blocklength n. Let us recall a theorem from [7] that for a general [n, k, b, σ] MDS array BP-XOR code such that the number of dot products satisfying, σ < k + (k − 1)/(b − 1), the blocklength of the code is bounded above by
which for a fixed σ asymptotically (k → ∞) makes the code rate (k/n) tend to 1. Note that fixing σ will help control the complexity of encoding/decoding and as we shall see in the next subsection the overall end-to-end latency. This constraint is later recognized and relaxed by the asymptotically MDS array BP-XOR code constructions in [10] , allowing better flexibility in terms of choosing the right code rate for the given coded computation system (i.e., scaling number of stragglers) at the expense of using slightly more processor work per node. More formally, for a given positive number b satisfying
a is a linear code with i-th column (y i,1 , . . . , y i,bi ) = (x 1 , . . . , x bk )G i for a bk × b i generator matrix G i , i ∈ {1, . . . , n} such that b = (1/n) i b i . Therefore, the generator matrix for C a is given by the following matrix of size
With this code, it is possible to perfectly reconstruct user data matrix from any k column combinations of C a using BP decoding and as b → ∞ we have b → b. Note that the code C a is not in two dimensional standard rectangle form. However, we introduced parameter b to be able to make it analogous to standard MDS array code representations through regular binary matrices.
For a given fixed code rate r = k/n and n, let us define (b, n) to be the maximum coding overhead 2 of C a satisfying b = (1 + (b, n) )b. The asymptotically optimal overhead property implies that as b → ∞ we have (b, n) → 0. The following theorem refines the upper bound on the code block length for asymptotically MDS array BP-XOR Codes. 
where σ = σ(1 + (b, n)) and (b, n) is the coding overhead.
Note that if b → ∞ we will have σ → σ and hence equation (4) becomes identical to equation (2) 
III. LATENCY ANALYSIS
In this section we shall primarily focus on the total computational latency of the proposed coded system. Then, we will shortly touch on communication cost and compare it with other well known schemes.
A. Computation Latency Analysis
Similar to the past studies, our time analysis also focuses on exponential task time for each processor. More specifically, we choose the most basic operation to be the "long dot product" operation (for large s) in our system, distributed exponentially with parameter µ i.e., having the probability density function f (t) = µe −µt and the cumulative distribution function (cdf) F (t) = 1 − e −µt . If i th processor of a cluster node performs σ dot products, then its cdf will be F (t/σ), a scaled version of the original distribution [2] . Also, the processing power of each p master node processors is c times greater than that of the compute cluster which makes the master processor computation rate to be cµ. The parameter c is referred as compute factor for the rest of our discussion. To minimize the workload of processors and maximize the parallelization, we assume m ≥ b and b < k for the rest of this section.
For a given group of b processors, the l th order statistics of (T 1 , . . . In the uncoded case, the average latency characterization is straightforward. Since there is no encode/decode operation at the master, all it takes to compute the product is to distribute kb dot products over kb processors and collect the result for a successful merge. In that case, the slowest processor output will determine the expected latency for the overall product computation i.e., E[T uncoded ] ≈ (1/µ) log(kb) for large k. The following theorem characterizes the asymptotical computation time of both encoding/decoding and parallel task completion for single dimensional [nb, kb] MDS polynomial codes scattered across the compute cluster. [9] to distribute computation over nb processors, the asymptotic latency (k → ∞) is given by
where c is the compute factor of the system.
Proof. Note that in general a j s and b j s (in our previous example) contain (b − 1) dot products each. The encoder performs these dot products for each processor except one (total of nb − 1), giving us a total of 2(b − 1)(nb − 1) ≈ 2nb 2 dot products executed in sequential manner. Decoding is based on the interpolation of a polynomial of degree kb − 1 and the best known algorithm to solve this is on the order kb log 2 (kb) operations [13] . Although operations might be more than a dot product, we estimate the complexity that way to target the best scenario, giving the competitors the best chance of winning. Note also that the master processing is exponentially distributed with parameter cµ and there are p processors in action bringing up the log(p) factor in the expression. On the other hand, the parallel executions perform only single dot product and any kb executions will suffice to recover the multiplication result, amounting to an expect delay of ≈ 1 µ log( n n−k ). Adding the expected encoding/decoding time and the parallel task time, the result follows.
Although polynomial codes provide order optimal parallel task time, the encode/decode time shall be the bottleneck for the overall performance if c (and it practically) does not scale with the increasing matrix sizes. Later studies have shown that MatDot codes can further improve the recovery threshold at the expense of worse T sm performance [8] . The following theorem characterizes its overall computation performance. 
where c is the compute factor of the system. Proof. We realize that the encoding of MatDot codes is very similar to polynomial codes resulting in the same order number of dot products 2nb 2 executed in sequential manner at the master. Similarly, decoding is based on polynomial interpolation but unlike polynomial codes, it suffices to collect k + b − 1 successful processor outputs to reconstruct the multiplication result which reduces the decoder complexity. Hence for kb elements, we have on the order of kb(k + b) log
All encode/decode operations are performed by p processors in parallel and hence the log(p) factor. On the other hand, the parallel executions perform only single dot product and any k + b − 1 executions will suffice to recover the the original result (a polynomial of degree k + b − 2), amounting to an expected delay of ≈ 1 µ log( nb nb−k−b+1 ). Adding the expected encoding/decoding and the parallel task time, the result follows.
We realize that even though the parallel task execution time performance of MatDot codes is better compared to polynomial codes, its total computation time is worse with scaling k. In addition, as we shall see later, it has worse communication cost T ms as well to help reduce T s m. In both computation schemes however, the encode/decode times seem to be the main bottleneck, especially for small c and p.
Next, we provide the latency performance of MDS array BP-XOR codes (AMDS) in the sublinear regime in the size of the product which distributes the encoding operation over the cluster nodes in order to achieve better end-to-end latency performance. 
Proof. In the proposed scheme, encoding does not involve any dot products at the master but it uses more bandwidth for (symbol) communication. Let us first consider the parallel task time in which we need to consider the distribution of order statistics rather than expectation. We initially consider a single network node, then we extend our analysis to expected order statistics for multiple nodes. Unlike expectation, distribution of order statistics is more challenging. For simplicity, we consider asymptotics and recognize that the j th order statistics of T i s i.e., T j:z (for z workers or processors) converges in distribution to a Gaussian if j/z → 1
We recall ith expected order statistics from [11] and through some algebraic manipulations, we can reach at
where α = 0.375. For b processors, we need to wait until all processors complete their job. Also, we need k nodes to finish their task before reconstruction. Thus, we consider T b:b as the limiting case which leads to
We note that the cumulative distribution function of the standard normal Φ(x) can be written as an infinite sum and can be approximated (for x → ∞)
from which it will follow
Let us define x := −2 log(1 − y) where y → 1 − which shall make x → ∞. By replacing x in equation (12), this would imply that we have
Next, we note that y = k−α n−2α+1 → 1 − as k → ∞ and can use equation (13) to approximate
Similarly, the denominator in (9) can be expressed as bf σ µ log (b) = µb 1−σ . Finally, the decoding performs ≈ σkb dot products sequentially in the worst case, leading to an average delay of σkb cµ . However, if we perform these operations in p parallel processors, and we need to wait for all operations to finish, we would obtain a delay of σkb cpµ log(p). Adding the expected decoding time and the parallel task time, the result follows.
We finally note that implication of σ being constant is that the number of backup nodes is sublinear in k, i.e., o(k). However, number of workers, i.e., processors can still grow by increasing the parameter b.
Needless to point out that in all of these theorems, we assumed relatively large p which aligns with the assumption that master nodes are typically more capable.
Next, we provide the latency performance of MDS array BP-XOR codes (AMDS) in linear regime in the size of the product i.e., Θ(k). In other words, the number of stragglers increase linearly with k where we assume (1 + δ)k total number nodes to compute the matrix multiplication for a fixed δ > 0. Note that classical MDS array BP-XOR codes cannot achieve δ > 0 as k → ∞. Hence, the following theorem is applicable to asymptotical version only [10] . Theorem 3.4. Let us use [n, k, b, b , σ ] Asymptotically MDS array BP-XOR code used over n = (1 + δ)k nodes for some δ > 0. Let also each node to be equipped with b i processors for 1 ≤ i ≤ n, doing σ i dot products at most for a fixed integer σ . If we define x i = σ i log(b i ) and x i1 and x in the to be minimum and maximum of x i s, respectively, then the asymptotic latency (k → ∞) can be upper bounded by
where
Proof. Similar to non-asymptotical version, let us first consider the parallel task time where we shall assume Gaussian approximation for the distribution of order statistics. We initially consider a single network node, then we extend our analysis to expected order statistics for multiple nodes. Based on intermediate order statistics i.e., j/b i → 1 and using the same notation for the total latency for the ith node Y i , we shall have
which shall model the delay for the ith cluster node with b i > 1 processors, each executing at most σ i dot products. For perfect reconstruction, we need any k nodes completing their assigned task to be able to reconstruct A B. Through some algebra, the kth expected order statistics can be found to be of the form
where α = 0.375,
where these results easily follow using the normal analysis for order statistics in [11] and for general distributions, the upper bound on the expected value of the kth order statistics as given in [12] . Note that setting b i = b (and hence σ i = σ) for all i satisfying 1 ≤ i ≤ n will lead to (9) . Let x i = σ i log(b i ) and x i1 ≤ x i2 ≤ · · · ≤ x in be the ordered set with x a = 0.5(x i1 +x in ). Based on this definition, We realize that we can bound the square term in equation (19) as follows
where |x i − x a | is at most (x in − x i1 )/2. This leads to
On the other hand, using Jensen's inequality we can upper bound µ b given by the sequence of inequalities
where σ max = max{σ i } and σ min = min{σ i }. Similar to the approximation given in (14) for large k, we can express
Finally, using the bound derived earlier for µ b and σ b , we can rewrite
On the other hand, the decoding performs ≈ σkb dot products sequentially in the worst case, leading to an average delay of σkb cµ . However, if we perform these operations in p parallel processors, and since we would need to wait for all operations to finish, the latency would be σkb cpµ log(p). Adding the expected decoding time and the parallel task time, the result follows.
We note that with b i = b and σ i = σ, this general result will be identical to the result of Theorem 3.4. We also notice that asymptotical version has paralel task time of O( √ k) whereas original version has O( log(k 2 )). This means that although the asyptotical version allows us better flexibility in choosing the number of strugglers in the network, due to unbalanced computation allocation among network nodes, its parallel execution becomes worse. However, the overall execution time is still linear in k, achiving the order optimal computation time from an end-to-end perspective. 
B. Communication Costs
In polynomial codes, after encoding, the generator communicate s symbols for each matrix to each processor, totaling up to 2snb symbols (compare this to uncoded case where a total of 2skb symbols are communicated). The sink collects kb symbols to initiate successful decoding. While using MatDot codes, the generator communicate the same amount of information with the processors i.e., 2snb. However, the decoder has to receive k + b − 1 processor output each is of size k × b. Hence the total number of symbols communicated with the sink for successful decoding is kb(k + b − 1).
Using MDS array BP-XOR codes, the generator will have to communicate 2σs for each processor. Using a total of nb processor, generator communicates 2σsnb symbols. The sink on the other hand has to collect at least kb symbols to initiate the linear-time iterative decoding. Note that in case of asymptotically MDS array BP-XOR codes, numerical quantification of the communication cost requires an explicit formula (and hence explicit code construction) [10] .
IV. NUMERICAL RESULTS
In this section, we provide numerical comparison of the expected end-to-end computational latency of the aforementioned coded schemes. We employ a Monte Carlo simulation to assess the average end-to-end computation time. Since we assume scaling clusters in our study, we assume k tends to large values. The full list of simulated parameters are shown in Table 1 . We have also set
to make sure that an appropriate selection of AMDS code can be made. We plot the expected latency E[T ] as a function of growing k as illustrated in Fig. 2 . In the uncoded case, master does not perform any computations and the matrix multiplication is distributed over kb processors. Due to stragglers, its performance is worse than polynomial codes and AMDS. Another observation is that since encoding/decoding requirements is escalating as k → ∞, the latency performances get worse. However, AMDS codes demonstrate an order of magnitude better latency performance compared to polynomial codes thanks to its suitable structure that allows low complexity decoding and distributed encoding. Although MatDot codes ensures better recovery threshold, the decoding complexity makes its end-to-end latency performance bad. In Fig. 3 , we have fixed k = 1000 and used all values given in Table III except p and c. We have set p = c and varied both between 2 and 100. Hence, as we go from left to right along the abscissa, master's parallel computation capability increases. Note that the uncoded scheme is not effected by master's computation capabilities as it does not call for any encoding/decoding. Also, fixing k implies fixing n = 1006 meaning that the code rate ≈ 1. For a typical choice of p = c = 10, this implies that using only few extra computations, MDS array BP-XOR scheme provides almost five times (20000 versus 4108) better computation time compared to uncoded scheme.
V. CONCLUSION AND FUTURE WORK
A novel coded matrix product scheme is presented under a realistic hierarchical compute cluster model using MDS array BP-XOR codes. The proposed scheme has few novelties: (1) it allows the computation of encoding to be distributed over the cluster nodes at the expense of increased communication cost, (2) it has extremely efficient decoding process based on XOR logic. Furthermore, it can be used as component codes of d-dimensional product coding schemes to allow for more powerful coded computations. Finally, due to iterative nature of decoding, this coding scheme is one of the best candidates for future master-less computation frameworks. One of the other ongoing works is the minimization of the communication cost due to offloading encoding to cluster nodes as well as scaling stragglers with the number of nodes.
