General-purpose hard-error resiliency solutions such as checkpoint-restart severely degrade performance. For numerical linear algebra, more efficient solutions incur lower overhead. Current solutions require a significant increase in the number of processors. Further, they are based on distributed algorithms that guarantee good performance only when the matrices are large enough to fill all the local memories. Otherwise, their inter-processor communication costs are asymptotically larger than the lower bounds dictate.
Introduction
Errors are a serious concern in high performance computing. Given the increase in machine size and decrease in operating voltage, hard errors (component failure) and soft errors (bit flip) are likely to become more frequent. Hardware trends predict two errors per minute up to two per second on exascale machines [2, 24, 11] . Here we address resiliency for hard errors. General-purpose hard-error resiliency solutions such as checkpoint-restart [22] , and disklesscheckpoints [22] successfully meet this challenge but are costly and severely degrade performance. Our methods enable efficient resource utilization and high performance for error-resilient algorithms that are close to the efficiency and performance of non-resilient algorithms.
For numerical linear algebra computations, certain efficient solutions incur considerably lower overhead by combining error correcting codes with matrix computations. These solutions are based on distributed 2D algorithms, and can guarantee high performance only when matrices fill all local memories. Otherwise, their interprocessor communication costs become asymptotically larger than the lower bounds, which degrades performance. Moreover, current solutions require a significant increase in the number of the processors.
Here we present new fault tolerant algorithms for matrix multiplication that reduce the number of additional processors and guarantee good inter processors communication costs. Specifically, for the 2D case, we decrease the number of additional processors from Θ( √ P ) to 1 (or from Θ(h √ P ) to h, where h is the maximum number of simultaneous faults), and we save a log P factor of the latency cost. We attain the bandwidth lower bound when f = O √ P , where f is the total number of faults. When local memories are larger than the minimum needed to store inputs and outputs, we reduce the communication costs using 2.5D technique, with no (or very few) additional processors, at-can occur simultaneously; i.e., the maximum number of faults in one step or iteration of the algorithm, and by f the total number of faults throughout the execution. When comparing an algorithm to a fault tolerant adaptation, we use (P, M, F, BW, L) to denote the resources used by the original algorithm and (P , M , F , BW , L ) to denote the resources used by the fault tolerant adaptation. We express the latter as a function of the former, of h, f , and the input size n. When a fault occurs, the faulty processor loses all its data, and the machine allocates a new processor to replace the faulty one. For simplicity, we assume no faults occur during recovery phases. Note that faults during the recovery phase of any of the algorithms we present may introduce at most a constant factor overhead to the recovery phase, and thus do not affect our analysis.
Previous work
Communication costs of (non-resilient) matrix multiplication. Cannon [10] , Van De Geijn and Watts [27] , and Fox et al. [15] proposed matrix multiplication algorithms that minimize communication when the memory is the minimum needed to store input and output, namely M = Θ , O √ P . Agarwal et al. [1] put forward a 3D algorithm that uses less communication when additional memory is available:
McColl and Tiskin [20] , and Solomonik and Demmel [25] generalized these algorithms to cover the entire relevant range of memory size, namely Θ
where c is the memory redundant factor, namely c = Θ [20] , Ballard et al. [5] , and Demmel et al. [14] used an alternative parallelization technique for recursive algorithms, such as classical and fast matrix multiplication. The communication costs of the 2.5D algorithm and the BFS-DFS scheme applied to classical matrix multiplication are both optimal (up to an O (log P ) factor), since they attain both the lower bound in Irony, Toledo and Tiskin [19] , in the range Θ
, and the lower bound in Ballard et al. [4] for larger M values.
Checkpoints-restart. One general approach to handling faults is checkpoint-restart; all the data and states of the processors are periodically saved to disk. Upon a fault, the machine loads the most recent checkpoint. This solution requires disks, which are expensive, for storing the checkpoints and incurs many I/O operations, which are time consuming.
Plank, Li, and Puening [22] suggested using a local memory for checkpoints instead of disks. This solution does not require additional hardware, and the writing and reading of checkpoints are faster. Still, the periodic write operations, as well as the restart operations significantly slow down the algorithms. Furthermore, this solution takes up some of the available memory from the algorithm. For many algorithms, matrix multiplication included, less memory implies a significant increase in communication cost, hence a slowdown.
Algorithm-based fault tolerance. Huang and Abraham [18] suggested algorithm-based fault tolerancy for classic matrix multiplication. The main idea was to add a row to A which is the sum of rows, and a column to B which is the sum of columns. The product of the two resulting matrices is the matrix A · B with an additional row containing the sum of its rows and an additional column containing the sum of its columns. They addresses soft errors, and showed that by using the sum of rows and columns it is possible to locate and fix one faulty element of C.
Huang and Abraham [18] used a technique that allows recovery from a single fault throughout the entire execution. Gunnels et al. [16] presented fault tolerant matrix multiplication that can detect errors in the input, and distinct between soft errors and round-off errors. Chen and Dongarra showed that by using the technique of [18] , combined with matrix multiplication as in Cannon [10] and Fox [15] does not allow for fault recovery in the middle of the run, but only at the end. This severely restricts the number of faults an algorithm can withstand. Chen and Dongarra [12, 13] adapted the approach described by Huang and Abraham for hard error resiliency, using the outer-product [28] multiplication as a building block. Their algorithm keeps the partially computed matrix C encoded correctly, in the inner steps of the algorithm, not only at the end. By so doing, they were able to recover from faults occurring in the middle of a run without recomputing all the lost data. In [13] they analyzed the overhead when at most one fault occurs at any given time. In [12] they suggested an elegant multiple faults recovery generalization of this algorithm. For this purpose they introduced a new class of useful erasure correcting codes. There algorithm requires 2h· √ P additional processors to be able to deal with up to h simultaneous faults. Further, they analyzed its numerical stability. Wu [29] et al. used outer product for soft error resiliency.
Hakkarinen and Chen [17] presented a fault tolerant algorithm for 2D Cholesky factorization. Bouteiller et al. [9] expanded this approach and obtained hard error fault tolerant 2D algorithms for matrix factorization computations. Moldaschl et al. [21] extended the Huang and Abraham scheme to the case of soft errors with memory redundancy. They considered bit flips in arbitrary segments of the mantissa and the exponent, and showed how to tolerate such errors with small overhead.
1.2 Our contribution. We introduce a new coding technique as well as ways to apply it to both 2D and 2.5D matrix multiplication algorithms. By doing so we obtain fault tolerant algorithms for matrix multiplication. Specifically in the 2D case we use only h additional processors, the minimum possible, and we use even fewer processors for the 2.5D algorithm. The runtime overhead is low, and the algorithms can utilize additional memory for communication minimizing. These algorithms can also handle multiple simultaneous faults. We also obtain a new efficient way for pipelining broadcast and reduce operations.
Paper organization.
In section 2 we provide preliminaries including a new efficient pipelined reduce operation. In Section 3 we focus on the minimum memory case, with resiliency for a single fault (see Table 1 ). In Section 4 we show how to extend this approach to tolerate multiple faults (see Table 2 ). In Section 5 we show how to combine our algorithms with methods that utilize additional memory (see Table 3 ). In Section 6 we compare the algorithms and discuss some open questions.
Preliminaries
2.1 Pipeline reduce operations. We use broadcast and reduce operations in our algorithms. Sanders and Sibeyn [23] showed an efficient algorithm for performing broadcast and reduce.
Lemma 2.1. ( [23] ) Let P be the number of processors, and W the data size of each processor. It is possible to compute a weighed sums of the data of the P processors, using:
We introduce an efficient way to perform l reduce operation in a row. The naïve implementation uses the algorithm above l times and requires (F,
. We pipeline the reduce operations and save latency.
Lemma 2.2. (Efficient multiple weighed sum)
Let P + l be the number of processors, and W the data size on P of them.
It is possible to compute l weighed sums of the data of the P processors on the l other processors with resources:
Proof. We first describe an alternative algorithm for a single weighted sum, then explain how it pipelines efficiently. The algorithm for one weighted sum has two phases. For ease of presentation, assume P is an integer power of 2 (the generalization is straightforward). The first phase reduces the weighed sum but the data remains distributed. The second phase gathers the data to the destination processor. The reduce works as follows. It divides the processors into two sets. Each set performs half of the task. The division involves communicating half of the data. Each set recursively calls the reduce. The base case is when each set contains only one processor. Then each processor holds 1 P fraction of the results. Next we gather the data to the additional processor. The reduction phase costs
, and L = log 2 P . Thus the total cost of the single wighted sum algorithm is:
This algorithm can be efficiently pipelined since the messages size decreases exponentially. Let the names of the processors be a binary string of length log 2 P . In the first phase the communication is between pairs of processors that agree on all the digits aside from the first digit. They communicate the first weighted sum. In the second step the communication is between processors that agree on all digits aside from the second and they send the second step of the first reduce the first step of the second reduce, and so on. Each weighted sum takes at most O (log P ) steps and then the data are sent to one of the l new processors. Therefore at any time at most O (log P ) weighted sums are being computed. The memory required for all the reduces that can occur in parallel is at most log P i=1 W 2 i ≤ 2W , and the memory required for all the gathering is at most
Linear erasure code
We use linear erasure code for recovering faults.
The erasures code we use preserve the original word and add redundant letters. Formally we code a word x of length k to a word y of length n using n − k additional letters such that y k+i = n j=1 E i,j · x j for some (n − k)×k matrix E. That is, the code generating matrix is of the form G = I k E n−k,k .
Minimum memory, single fault
In this section we discuss previous and new fault tolerant algorithms, for M = Θ n 2 /P .
Previous algorithms.
Chen and Dongarra [12, 13] used the Huang and Abraham scheme [18] to tolerate hard errors. Specifically, they added one row of processors that store the sum of the rows 1 of A and similarly for C , and one column of processors that store the sum of the columns of B and similarly for C. They called these rows and columns the check-sum; a matrix that has both is called a fully check-sum matrix.
Chen and Dongarra showed that this approach, applied to 2D algorithms (e.g., Cannon [10] and Fox [15] ), allows for the recovery of C at the end of the matrix multiplication. However these 2D algorithms do not preserve the check-sum during the inner steps of the algorithm. To deal with higher fault rate, that requires recovery of faults during the run of the algorithm, Chen and Dongarra used the outer product as the building block of their algorithm. Thus their algorithm can recover faults throughout the run of the algorithm at the end of each outer product iteration. Lost data of A and C of a faulty processor can be recovered at the end of every outer product step, from the processors of the same column from the processor in the check-sum row. Similarly, the data from B and C can be recovered using the check-sum column.
Theorem 3.1. ( [12, 13] ) Consider a 2D communication optimal matrix multiplication algorithm with resources (P, F, BW, L). Let (P , F , BW , L ) be the resources required for the fault tolerant 2D matrix multiplication algorithm of Chen and Dongarra that can with-stand a single fault at any given time. Let n be the matrix dimension and let f be the total number of faults. Then:
A proof can be found in [8] .
Our algorithms. Since matrices A and B are not modified by the algorithm, lost input data can be easily handled using an erasure code of length P + 1. The main challenge involves recovering C. To this end, we introduce two alternatives: the first algorithm uses the outer product and encoding of the blocks of C with additional processors. This is similar to the approach by Chen and Donagarra. However we use a new coding scheme that decreases the additional processor count from Θ √ P to one. We denote this algorithm by slice-coded algorithm. The second algorithm recovers the lost data of C by recomputing at the end of the run. We denote this the posterior-recovery algorithm.
Theorem 3.2. (Slice-coded)
Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand a single fault at any given time, where n is the matrix dimension and f is the total number of faults, with resources:
Theorem 3.3. (Posterior-recovery) Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand a single fault at any given time, where n is the matrix dimension and f is the total number of faults, with resources
3.2 Slice-coded algorithm. We follow the approach in [12, 13] , and use the outer product matrix multiplication as the basis for the algorithm. However, where they used 2 · √ P + 1 additional processors, for the coded data, we only use one. The additional processor contains the sum of the others. This processor acts similarly to the corner processor in Chen and Dongarra's algorithm (corresponding to the red processor in Figure 1 ). It contains the sum of the blocks of A,B, and C. In the s iteration of the algorithm, it multiplies with at most one simultaneous fault. n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm. [10] , SUMMA [27] (No fault)
Algorithm
the sum of the current row with the sum of the current column. Thus it keeps the sum of the blocks of C updated. In each outer-product iteration the algorithm computes the sum of the current outer product. We show in Equation 3.1 that the sum of the blocks of C can be computed by multiplying two sums of blocks. Proof.
[Proof of Theorem 3.2] The algorithm allocates one additional processor for the code, thus P = P + 1.
The algorithm is composed of three steps. In the first step, code creation (CC) the algorithm creates codes for A and B and stores them in the additional processor. The second step is the matrix multiplication (MM). Upon a fault, a recovery (Re) step is performed. Therefore F is composed of three components, namely,
Code creation. In this step, the algorithm computes the sum of the blocks of A and of B, and stores them in the additional processor,using a reduce operation. By Lemma 2.1 this takes:
Matrix multiplication. The matrix multiplication phase is performed as in an outer-product algorithm with a small change: every processor computes its share of the code. To be more precise, in the s th iteration (of √ P iterations) the processors compute the outer product A (:, s) · B (s, :). The processors of the current block column of A and the processors of the current block row of B broadcast them. The processors compute the sum of the current block column of A; specifically each column of processors computes 1/ √ P of this sum. Similarly, the processors compute the sum of the current block row of B. The processors send these two sums to the additional processor. Then each processor multiplies the two blocks.
By Theorem 2.2 the broadcasting (B) takes
, O (log P ) . The reduce operation is distributed among the rows and the columns, where each row and column of processors performs a reduce operation with an n 2 P 3/2 block size. Therefore this reduce operation (R) takes:
The multiplication of two blocks in time is 2n 3 P 3/2 . There are √ P iterations; thus the multiplications takes:
Recovery. Each recovery is a reduce operation. By Lemma 2.1 f recoveries take:
Total costs. Summing up Equations 3.2, 3.3, and 3.4 we have
3 Posterior-recovery. In this algorithm we recover output by re-computation. That is, A and B input matrices are coded, but C is not. A faulty processor incurs the restoration of its share of A and B. But recomputing its lost share of the workload is performed at the end of the algorithm, using all processors. When a fault occurs, the algorithm recovers the lost data of A and B using their code, initializes the lost block of C to zeros, and resumes computations. Definition 3.1. We denote by a cube the set of scalar multiplications defined by the two blocks (sub-matrices) multiplication.
Proof. [Proof of Theorem 3.3] We assume that at each iteration, at most one fault occurs. Therefore the algorithm needs only one additional processor to encode A and B, namely, P = P + 1. 
Code creation. The costs of this phase are as in the Slice-coded algorithm see Section 3.2.
Matrix multiplication. The algorithm performs 2D matrix multiplication (e.g., Cannon's [10] ), thus
Input recovery. By Lemma 2.1 the costs of f recoveries are:
Output recovery. This stage involves communication, multiplication, and reducing the data. We assume that the maximum number of faults in an iteration is 1. Each processor computes √ P cubes. Therefore there are at most P cubes to compute again, as there are √ P iterations. The algorithm distributes the workload of the lost cubes. Each processor gets at most one cube. Since computing a cube is multiplying two blocks of size
flops. The communication cost is due to moving two input blocks and the reduce of C. Thus it takes BW ReOut = O n 2 P , and L ReOut = O (log P ). We have
Total costs. Summing up Equations 3.2, 3.5, 3.6, and 3.7 we have with at most h simultaneous faults. n is the matrix dimension, P is the number of processors, and f is the total number of faults occurring throughout the run of the algorithm. [10] , SUMMA [27] (No fault)
Multiple faults
In this section we extend our algorithms to a number of simultaneous faults.
Previous algorithm.
Theorem 4.1. ( [12] ) Consider a 2D communication cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources The algorithm adds h rows of processors to A and C and h columns of processors to B and C.
It stores weighted sums of the original processors in the additional processors. Their algorithm uses √ P , √ P + h, h + 1 -code, with a generating matrix
such that every minor of E is invertible. It can therefore recover h simultaneous faults, even if they occur in the same row or column of processors. We provide a proof for Theorem 4.1 in [8] .
Erasure correcting code.
For multiple faults we use erasure code ()recall the definition in Section 2.2). To withstand h simultaneous faults we require a (P, P + h, h + 1)-code. In other words, any P letters suffice to recover the data. This is possible if and only if every minor of size P of the generating matrix G = I P E h×P is invertible. In other words, every minor of E is invertible. Similar to the single fault case, each code processor multiplies a weighed sum of the current block column of A with a weighed sum of the current block row of B, and adds it to the accumulated sum. Thus the weighed sum is of the form:
where w i,j = v i · u j for some vectors v and u. The code used in [12] does not have the above property, and therefore cannot be used for our purpose. We show that there exists a code with the required properties. 
Proof. Consider an erasure code with generating matrix I E , where I = I P , and E is an h × P Vandermonde matrix. Every minor of the Vandermonde matrix is invertible. The i th row of the Vandermonde matrix is of the form r i = α
. By taking Let n be the matrix dimension. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources
Proof. We showed in Section 4.2 how to use h additional processors to obtain a code with distance h + 1; thus P = P + h. The rest of the analysis is similar to the single fault case in the proof of Theorem 3.2. We have F = F CC + F MM + F Re , and similarly for BW and L . Code creation. The algorithm first creates an erasure code for A and B. By Theorem 2.2 as W = n 2 P and l = h this takes:
Matrix multiplication. The multiplication involves broadcasting and reduction of h weighted sums. Each column of processors computes h √ P weighted sums of the blocks of A and each row of processors computes h √ P weighted sums of the blocks of B. The broadcasting and reduction (BR) takes:
The multiplication of two blocks takes O
flops.
There are √ P iterations. Therefore,
Recovery. When faults occur, the portion of A,B, and C of the faulty processor are recovered at the end of the iteration, using the erasure code. Assume that at iteration i that the number of faults is f i . By Theorem 2.2, as W = n 2 /P and l = f i > 0 the recovery takes:
Total costs. Summing up Equations 4.9, 4.10, and 4.12 we have,
Posterior-recovery. This algorithm allocates h processors for encoding A and B. It runs a 2D matrix multiplication (e.g., Cannon [10] not just outer product ones). When a processor faults, the algorithm recovers A and B and proceeds. After the multiplication the algorithm re-computes the lost portion of C. Let n be the matrix dimension. Then there exists a fault tolerant 2D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources
The analysis of this algorithm is very similar to the single fault case. We provide a proof for Theorem 4.3 in [8] .
Memory redundancy
We next present the extension of our two algorithms to the case where redundant memory is available, namely
Recall that a 2.5D algorithm splits the processors into c sets, where each set performs 1 c of the iteration of 2D algorithm. When h is the maximum number of simultaneous faults, each set of processors has to be able to tolerate h simultaneous faults. For ease of analysis we assume that the faults are distributed uniformly among the c sets. If this is not the case the algorithm can divide the computations differently, and assign more computation to a set that has fewer faults. This is possible since each set of processors has sufficient data to perform all these computations.
Slice-coded.
Theorem 5.1. (Slice-coded) Consider a 2.5D communication-cost optimal matrix multiplication algorithm with resources (P, F, BW, L). Let n be the matrix dimension and let M be the local memories size. Let c be the memory redundancy factor, namely c = Θ P ·M n 2 . Then there exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
).
The proof of Theorem 5.1 is similar to the proof of Theorem 4.2 and is provided in [8] . Roughly speaking the fault tolerant 2.5D algorithm splits the processor into c sets (c = Θ P ·M n 2 ). Each set allocates h processors for code, and then creates code for A and B. The matrix multiplication is divided equally among the sets, and each set performs Let n be the matrix dimension and let M be the local memories size.
Let c be the memory redundant factor, namely c = Θ P ·M n 2 . Then there exists a fault tolerant 2.5D matrix multiplication algorithm that can withstand h simultaneous faults at any given time, and f total faults, with resources:
O (L · log c + log P )).
Proof. As explained above P = P . The algorithm does not create code, and similar to the 2D case it recovers the input immediately and re-computes the lost output data after the multiplication ends. Therefore
Matrix multiplication. The algorithm performs a 2.5D matrix multiplication therefore,
Input recovery. The algorithm recovers faults at the end of each iteration. Since c > h there is at least one copy of each block even when h processors fault simultaneously. If k processors that hold the same block of A (or B) fault simultaneously, the algorithm broadcasts this block. Therefore in the worst case, this recovery requires O (log k) messages. Recall that in the i th iteration f i < c processors fault. By Lemma 2.1 it costs:
thus the total recovery costs are: Output recovery. After the 2.5D matrix multiplication is completed, the algorithm computes the lost cubes (recall Definition 3.1). When a processor faults it loses O P/c 3 such cubes. Each processor gets
such cubes for recomputing, and multiplies pairs of them. The block size is
; therefore multiplying two blocks costs
flops. Thus the costs are:
We add log P to the latency because the output recovery may include the broadcast operation of the blocks and the reduce operation of the results. Total costs. Summing up Equations 5.13, 5.14, and 5.15:
= O (L · log c + log P )
Discussion
In this paper we presented two methods for obtaining fault tolerance at lower costs: the slice-coded algorithm and the posterior-recovery algorithm. Both can handle multiple simultaneous faults. When the memory is minimal both algorithms use as few processors as possible; namely h, where h is the maximum number of faults that may occur in one iteration. We showed how to combine these methods with a 2.5D algorithm that utilizes redundant memory, to reduce the communication costs. When the number of fault is not too large our algorithms only marginally increase the number of arithmetic operations and the bandwidth costs. The slice-coded algorithm increases the latency by a factor of log P . If faults occur in every iteration of the posterior recovery algorithm, its latency increases by a factor of log P as well.
The slice-coded algorithm uses the outer-product in each iteration and keeps the code processors updated. The outer product uses up to a constant factor more words, and up to O (log P ) factor more messages. Therefore, the slice-coded algorithm communicates a little more, but it can recover faults quickly at each iteration. In contrast, the posterior recovery communicates less in this phase, but performs more operations to recover faults. Therefore the slice-coded algorithm is more efficient when many faults occur, and useful when quick recovery is needed. For fewer faults, the posterior recovery is more efficient.
The posterior recovery with redundant memory uses the input replication of the 2.5D algorithm. It utilize the redundant memory to reduce communication costs and to reduce the number of required additional processors. We analyzed the case of h < c, where the maximum number of simultaneous faults is smaller than the number of copies of the input. In this case the algorithm does not need to allocate additional processors but rather recovers the input using the existing replication. We do not analyze here the case of h ≥ c, where h − c + 1 additional processors are required, and the recovery run-time depends on the faults distribution. Briefly, in this case, if a code processor faults, the recovery requires computations, whereas when an original processor faults, the recovery uses the input replication, and is very fast. For Strassen's [26] and other fast matrix multiplication, Ballard et al. [5] described a communication optimal parallelization that matches the communication costs lower bound [6] . However, this parallelization technique does not allow for a direct application of either methods introduced in this paper. Recently, we designed a new method that enables fault tolerancy fast matrix multiplication algorithms at low overhead [7] .
Acknowledgments
This work was supported by grants 1878/14, and 1901/14 from the Israel Science Foundation (founded by the Israel Academy of Sciences and Humanities) and grant 3-10891 from the Ministry of Science and Technology, Israel. It was also supported by the Einstein
