Abstract. This paper describes mathematical procedure for designing hexagonal systolic arrays that implement fault-tolerant matrix multiplication. Fault-tolerance is achieved by introducing redundancy at algorithm level by defining three equivalent algorithms with disjoint index spaces. The essence of the proposed method is based on mapping data dependency graph that corresponds to the matrix multiplication algorithm, by an appropriate epimorphism, into a graph with desired properties. Since there is a 1:1 correspondence between the algorithm and it's graph representation, all transformations performed on the graph directly affect the algorithm. Chosen epimorphism depends on the projection direction vector
Introduction
Matrix multiplication is one of the fundamental operation in many scientific and technical applications. It appears as a macro-function within many numerical and non-numerical problems. Typical examples include solving problems in linear algebra, such as finding inverse matrix, determinant, LU factorization, eigenvalues, etc., graph theory, such as transitive closure, finding all-pairs shortest paths, minimal spanning tree, etc. Matrix multiplication is very time consuming since it requires O(n 3 ) computational steps. On the other hand, it is highly regular and posses inherent parallelism and is therefore suitable for implementation on parallel architectures such as processor or systolic arrays.
Nowadays, CMOS VLSI technology made it possible to develop massively parallel computational systems. However, shrinking dimensions of transistors, which bring lower operating voltages and smaller critical charges, make these devices prone to various kind of faults. These faults can be categorized into three types: transient, intermittent, and permanent faults. Transient faults are failures in which the circuit remains functional, but the value it generates is corrupted. Transient faults are caused by strikes from energetic particles, such as alpha particle strikes generated within the package of the chip, or neutron strikes caused by cosmic ray activity in outer space. Intermittent faults are caused by the combination of device aging and by stress created by environmental conditions during chip operation [1, 2] . Permanent errors can occur because of manufacturing imprecision during chip fabrication. When a permanent fault is detected, the logic block must be isolated and disabled. Here we are interested in the toleration of transient and intermittent faults. To protect against these faults, some form of error detection and correction capabilities have to be involved. A variety of techniques have been proposed for handling transient and intermittent faults. Most of them are based on some kind of redundancy, ether information (such as ABFT), hardware (such as TMR), or time (such as RESO). In this proposal, fault-tolerance is achieved through triplicated computation followed by majority voting, but instead of using full hardware redundancy (i.e. triplication) to achieve fault-tolerance, we generate redundant results partially by means of hardware (by adding two extra columns of processing elements in the systolic array) and partially by means of time redundancy by slightly increasing computation time. We are interested in the designing of hexagonal systolic arrays with the optimal number of PEs with respect to the problem size which implement fault-tolerant matrix multiplication. In order to reduce the number of PEs and the execution time, we use the fact that during each computational step, matrix multiplication can be performed in any order without modifying the final result. By involving two additional columns of PEs and the reordering of computational steps, we achieve that our hexagonal arrays have Ω = N 3 (min{N 1 , N 2 } + 2) PEs and perform fault-tolerant matrix multiplication for
The rest of the paper is organized as follows. Section 1 describes the problem of interest which will be considered. Mathematical background that serves as the basis for the synthesis of fault-tolerant systolic arrays is presented in Section 3. Section 4 deals with the procedure for the synthesis of systolic arrays with optimal number of processing elements. For that number of PEs the execution time is minimized. Section 5 concentrates on determining initial data arrangement that provides correct implementation of fault-tolerant matrix multiplication algorithm. Performance evaluation of the synthesized arrays which relates to the number of PEs and execution time are given in Section 6. Some concluding remarks are given in Section 7.
Problem Definition
Suppose A = (a ik ) and B = (b k j ) are rectangular matrices of order N 1 × N 3 and N 3 × N 2 , respectively. Their product, C = A · B, C = (c i j ), can be computed according to the following recurrent relations
The following systolic algorithm can be used to compute (1)
with the following identities
Note that computations in Algorithm 1, i.e. eqn. (1), with respect to indices i, j and k, are performed at basic permutations of the sets {1, 2, . . . , N 1 }, {1, 2, . . . , N 2 }, and {1, 2, . . . , N 3 }, respectively. However, this is not necessary. The computations can be performed at arbitrary permutations of the aforementioned sets (see [8] ). We call this property permutability.
Directed coordinated lattice graph G = (P int ∪ P in , D) can be joined to the Algorithm 1. This graph is placed in a three-dimensional Cartesian space generated by the unity vectors
T } defines vertices of graph G where active computations from Algorithm 1 are performed, while 
These vectors also define propagation direction of elements of matrices B, A and C (0) , respectively. Systolic array synthesis procedure, employed in this paper, is based on mapping the graph G on the plain orthogonal to the projection vector µ = [µ 1 µ 2 µ 3 ]
T (see [5] ). By this mapping a directed pseudo-graph in two-dimensional Cartesian space is obtained. This graph corresponds to the desired SA. Here we are interested in the synthesis of two-dimensional hexagonal systolic arrays (2D HSA), suitable for fault-tolerant matrix multiplication. The first condition for SA to be suitable for fault-tolerant computations is that it realizes Algorithm 1 with pipeline period λ ≥ 3. Pipeline period is defined as a time interval between two successive computations in a processing element (see [9, 11, 12] ). Hexagonal array obtained according to the direction µ = [1 1 1] T satisfies this condition and it is well studied in the literature (see [7, 12] ). Here we are interested in hexagonal arrays obtained for the directions µ = 1 1 −1
. These arrays realize Algorithm 1 with pipeline period λ = 1, and therefore were not considered as suitable candidate architectures for achieving fault-tolerance. However, these arrays have shorter execution time of matrix multiplication algorithm than the array obtained by the projection vector µ = [1 1 1] T . We are going to show that the fault-tolerant matrix multiplication can be efficiently implemented on these arrays as well, with shorter execution time than that of the one obtained by the projection vector µ = [1 1 1] T . In this proposal, without the loss of any generality, we are going to synthesize 2D HSA obtained by the projection direction vector µ = [1 1 − 1]
T , which implements fault-tolerant matrix multiplication algorithm. The other two arrays can be synthesized by the same methodology. Apart from being suitable for faulttolerant computation, the synthesized array should have optimal (i.e. minimal) number of processing elements (PEs) for the given matrix dimensions. Also, the execution time of the array should be as small as possible for the given number of PEs. This requires substantial modification of the systolic array synthesis procedure described so far in the literature [3] [4] [5] [6] [7] .
Mathematical Background
Results obtained in this section serve as a background for the synthesis procedure which will be described in the next one. We start our analysis from the directed graph G = (P int ∪ P in , D) which corresponds to the Algorithm 1. Depending on the relation between N 1 and N 2 , we map vertices of graph G, defined by set P int , either into set
. This is performed by the isomorphisms ϕ 1 or ϕ 2 , defined by the following equalities
for each p ∈ P int . In this way we have obtained new directed graphs
Now, for the projection vector
We use mappings H 1 and H 2 to obtain setsP
, respectively. These sets are obtained according to the following equalities
Thus we have obtained two directed graphs
Some important features of mappings H 1 and H 2 and corresponding directed graphs
and
, are given in the subsequent theorems. Proof The first part of the assertion of Theorem 3.1 directly follows from the equality det M i = µ i = 1 0, i = 1, 2. The second part of the assertion will be proved by the contradiction on the example of mapping H 1 .
Suppose, contrary to the assertion of Theorem 3.1, that there are at least two different nodes of graph G
that are mapped by H 1 into the same node of graphḠ (0) . Let
p (2) , be position vectors of these nodes. In that case, according to (2), the following would be valid
From (8) it follows i 1 = i 2 . Accordingly, from (9) and (10) it follows j 1 = j 2 and k 1 = k 2 , so we have p (1) = p (2) , which is in contradiction with the assumption that p (1) p (2) .
On the basis of Theorem 3.1 we conclude that graphs G (0) and
) have equal number of nodes which are differently arranged in the coordinated space. Since Proof Without loss of generality we will conduct the proof for the graph Note that in practice a plane that is orthogonal to line l is usually taken. According to Theorem 3.2 and Corollaries 3.3 and 3.4 one can get a geometrical idea of mapping
2 ) which corresponds to the respective 2D SA. However we need explicit analytical procedure for obtaining graph Γ 1 (i.e. Γ 2 ). First we define epimorphism S which is joined to the projection vector
where S 
2. In order to avoid mapping 3D graph G ) into 1D graph, the rank of matrix S must be 2, i.e. rankS = 2. 3. To preserve the adjacency of nodes that were present in graph G Conditions (1) to (3) do not necessarily provide one of the crucial requirements, which is that graph Γ 1 (i.e. Γ 2 ) has N 2 N 3 (i.e . N 1 N 3 ) vertices. The next theorem gives necessary and sufficient conditions for epimorphism S so that this requirement is always fulfilled. 
In order to have exactly N 2 N 3 ordered pairs of the form (s 12 j + s 13 k, s 22 j + s 23 k), necessary and sufficient conditions are the ones given by (13) .
Similarly, the following theorem can be proved. Note that epimorphism S can not be uniquely determined. For example, if
T is an epimor-
T . Therefore, without loss of generality, according to Theorems 3.5 and 3.6, and conditions (1)-(3), we can take the following epimorphisms
respectively.
Systolic Array Synthesis
Based on the results obtained in the previous section, we are going to synthesize 2D HSA that realizes fault-tolerant matrix multiplication.
Since fault tolerance is achieved by triplicated computation of the same problem instance followed by majority voting, we first have to derive three matrix multiplication algorithms equivalent to Algorithm 1. To do that, we perform the shifting of vertices in the directed graph
along the coordinate axes to obtain two additional graphs
, D), r = 1, 2 with the following set of vertices
Note that this shifting is not unique. The only thing we should take care of is that sets P (r) int , (i.e.P (r) int ), r = 0, 1, 2 are mutually disjoint.
If we apply mapping H 1 (i.e. H 2 ), defined by (7), on the set of vertices of G (r) (i.e.G (r) ), r = 1, 2, we obtain directed graphs defined by the following set of vertices
for r = 1, 2. Now we are able to define systolic algorithms that correspond to graphs G ), r = 0, 1, 2, obtained by epimorphism S 1 (i.e. S 2 ). This image is located in the 2D cartesian space and corresponds to the 2D HSA which implements the fault-tolerant matrix multiplication algorithm -Algorithm 2 (i.e. Algorithm 3). Denote the corresponding array by SA1 (i.e. SA2). Then (x, y) locations of the processing elements in the SA1 array (i.e vertices of Γ 1 ) are determined from the following equalities
for r = 0, 1, 2, j = 1, 2, . . . , N 2 and k = 1, 2, . . . , N 3 . Directed edges in Γ 1 , which correspond to communication channels in SA1, are determined by column vectors of matrix ∆ 1
Similarly, for the array SA2 we obtain that PE locations are determined by
for r = 0, 1, 2, i = 1, 2, . . . , N 1 and k = 1, 2, . . . , N 3 , and communication channels by the column vectors of matrix ∆ 2 
Obtaining Initial Data Arrangement
So far we have obtained locations of the PEs in the x-y plane and directions of communication channels between them. Now we have to determine initial data arrangement that provides correct computation of the fault-tolerant matrix multiplication algorithm on the systolic array. For the sake of brevity we will do this only for the array SA1.
From Algorithm 2, i.e. corresponding directed graphs G
, initial data arrangement of elements of matrices A, B and C (0) in a three-dimensional space is defined by the set P
(c), where
However this data arrangement does not provide correct execution of Algorithm 2 on the array SA1. This is illustrated in the Figure 1 . To simplify, consider a directed graph with four vertices, P1 to P4, which belong to the internal computation space, P int , where MAC (multiply with accumulation) operation is performed. With a, b, c, and d input data belonging to the initial computation space, P in are denoted. For these inputs the following products ac, ad, bc and bd should be obtained. However, we will obtain only two products, namely ac and bd, since data are pipelined through the array (Fig. 1a) ). Therefore we have to rearrange the space of initial data. In essence it is necessary to skew input data in time as illustrated in Fig.  1 b) . 
Skewing of P (r) in
is performed by the timing function, t( p) = Π T p + α (see [6] ). This function defines time instance when the computation in the point p ∈P , for some constant r, where computations from Algorithm 2 are performed at the same time instance. To narrow down choices, the following constraints for elements of vector Π are introduced:
• The norm of Π, | Π| 2 = |t 11 | 2 + |t 12 | 2 + |t 13 | 2 , t i j ∈ Z, should be as small as possible, and
Suppose that computation in Algorithm 2 is performed at the vertex p ∈ P
Having in mind the above mentioned constraints for the elements of vector Π, we can take
Constant α is determined by the appropriate choice of vertex in graph G
where the first computation, defined by Algorithm 2, should occur. Let it be a vertex defined by a position vector
T , where
, p p 0 , we obtain α = −3N 1 + 1, so we have that timing function is defined as
is performed by the following mapping
and vectors e 3 γ , γ ∈ {a, b, c} defined by (4) . The rearranged space of initial computations
Finally, by mapping the setP (r) in using epimorphism S defined in (14), initial data arrangement of input elements A, B and C (0) in the (x, y)-plane at the beginning of the computation in the array SA1 is obtained. The arrangement of input data in the (x, y)-plane is given by
for r = 0, 1, 2, i = 1, 2, . . . , N 1 , j = 1, 2, . . . , N 2 and k = 1, 2, . . . , N 3 . We assume that the following identities in (19) are valid:
. . , N 2 and t = 1, 2, . . . , N 3 . An example of the SA1 array that implements fault-tolerant matrix algorithm for the case N 1 = 4, N 2 = 3 and N 3 = 2 is presented in Fig. 2 By the similar procedure one can obtain initial data arrangement of input elements A, B and C (0) in the (x, y)-plane at the beginning of the computation in the array SA2. It is given by
An example of the array SA2 that implements fault-tolerant matrix algorithm for the case N 1 = 4, N 2 = 3 and N 3 = 2 is presented in Fig. 2 . As can be seen from Figs. 2 and 3 the array SA1 has fewer PEs than SA2 since N 1 > N 2 . A schematic of the voting logic is presented in Fig. 4 . It consists of 3 * By the proposed scheme single transient and intermittent faults can be corrected. A number of multiple fault patterns can be tolerated also, if they do not affect the same element of the resulting matrix. Fault detection and location are not necessary for fault-tolerance, errors are masked concurrently with normal operation of the systolic array.
Discussion
According to (15) and (17) the following result can be proved. T that implements Algorithm 2 (i.e. Algorithm 3) consists of
processing elements.
According to (21) it follows that optimal number of PEs in the SA depends on mutual relation between N 1 and N 2 . This also means that this relation directly determines whether Algorithm 2 or Algorithm 3 should be used. T that implements Algorithm 2, i.e. Algorithm 3, has
Note that hexagonal SA synthesized for the direction µ = [1 1 1] T that implements fault-tolerant matrix multiplication obtained in [7] has the same number of PEs.
Total execution time, T tot , of matrix multiplication algorithm on systolic array includes time to load data into the array, T in , active execution time, T exe , and time to drain the array, T out , i.e. T tot = T in + T exe + T out . If G = (P, E) is directed graph that corresponds to an algorithm, and t( p) is timing function, then active execution time can be determined from the equation T exe = max t( p) − min t( p) + 1, p ∈P r int [10] . In our case the following result holds. 
Proof Timing function of the array SA1 that implements Algorithm 2 is t( p) = u + v + w − 3N 1 + 1. Consequently, active execution time is T exe = 3N 1 + N 2 + N 3 − 4. In [4] hexagonal SA with optimal number of PEs which computes matrix product was synthesized for the direction µ = [1 1 − 1]
T . It consists of N 3 × N 2 PEs, and has total execution time T tot = T exe , i.e. T in = T out = 0. The array SA1 synthesized in this paper represents an extension of the one obtained in [4] for 2N 3 PEs. Namely, two PEs are added in each row of PEs. Since elements of matrix A propagate in that direction, the initialization time is increased for two time units, i.e. T in = 2, while T out remains 0. Accordingly, we have that total execution time of Algorithm 2 on the array SA1, synthesized by the equations (15), (16) and (19), is
Similarly, it can be concluded that total execution time of Algorithm 3 on the array SA2, synthesized by equations (17), (18) and (20), is
Now, statement of the theorem directly follows from the equations (24) and (25).
Hexagonal array synthesized in [7] has total execution time equal to T tot = 3 max{N 1 , N 2 }+2 min{N 1 , N 2 }+ N 3 −3. This means that the array synthesized in this paper has shorter execution time for ∆T tot = min{N 1 , N 2 } time units which is achieved by the same number of PEs as in [7] .
Conclusion
Systolic arrays are suitable accelerator architectures for matrix multiplication. They are complex CMOS VLSI circuits that exploit massive parallelism at data level. However, shrinking the dimensions of transistors make this circuits prone to various kind of faults. Reliable operation of these circuits can be achieved by some kind of redundancy, which in general involves additional hardware or time, that as a consequence decreases system throughput. Therefore the main design challenges are oriented toward decreasing hardware complexity and increasing the throughput. In this paper we describe a systematic procedure for the designing of 2D hexagonal arrays that can tolerate single transient and intermittent faults during matrix multiplication (MM). In this proposal we start from the hexagonal array which implement MM algorithm with pipeline period λ = 1, and by involving appropriate modifications at algorithm level, obtain the array with λ = 3. Then two redundant computations are introduced so that fault-tolerance is achieved by triplicated computation followed by majority voting. This is achieved with minimal hardware and time overhead. The proposed procedure is formally described by explicit formulas and can be used as a software tool for automatic synthesis of fault-tolerant arrays.
