Abstract: Although the recurrence equation for the Winograd algorithm is uniform, no unified approach has been proposed to design parallel Winograd algorithms. In the paper the authors propose a unified approach to designing parallel Winograd algorithms. Using this approach, several parallel algorithms are designed. 'These algorithms are executed on regular arrays including conventional systolic arrays and nonplanar regular arrays. A comparison of their performance is given.
Introduction
There are many sequential algorithms for computing a matrix product, such as the standard multiplication algorithm [l-31, Winograd's algorithm [4, 51 , and Strassen's algorithms [SI. Among these algorithms, the equations for standard multiplication algorithm, i.e. and the equations for the Winograd algorithm, i.e. for 1 < i , j < n are suitable to be executed on regular arrays , because these equations are repeated and iterative. Based on the equations for standard multiplication algorithms, extensive researches on the design of parallel matrix multiplication algorithms have been carried out. These parallel algorithms include not only the algorithms for solving matrix multiplication problem but also for other matrix product-type problems such as band matrix multiplication [ 1, 141, bit-level matrix-vector multiplication problem, continuous matrix multiplication [ 15, 161, and discrete Fourier transformation [17] . However, only a few papers have used the equations for the Winograd algorithm to design parallel algorithms on regular arrays because its recurrence equation is less regular than that of the conventional standard multiplication algorithm.
IEE, 1994
Paper 99818 (C2 In this paper we use the number of processors, total execution time, and the utilisation of each processor as criteria to compare the performance of various parallel Winograd algorithms. From this comparison, we conclude that the torus array algorithm have the shortest execution (excluding the loading and draining time) and the utilisation of each processor in the torus array algorithms is the highest.
Design methodology
Let n be even for the sake of simplicity. In the Winograd algorithm, the product C = A x B is computed as 
The advantage of this algorithm is that the coefficient ai(pj) needs to be evaluated only once while it is used for the whole row i (column j ) of the matrix D. For convenience of analysis, we may rewrite the above equation as follows: scheduling and processor assignment of the shaded circular nodes and ignore the time requirements for computing aik and pkj on evaluating the total execution time of the parallel algorithm. To give a unified approach on the design of various types of parallel Winograd algorithms, we adopt the design method described in Reference 13. In Reference 13, the timing schedule and processor assignment of nodes in a DG are represented by a timing level , respectively. The TLT is a three-dimensional array and the PAT is a two-dimensional array. Let r, s, and q be the first, second, and third dimension of the TLT, respectively. Depending on the chosen projection, k, j or i-directions, (r s q) is set to (i j k), (i k j ) or (k j i), respectively. The number t,,, on the position (r, s, q) of the TLT specifies that the computation of eijk is performed at time trrq and the number pya on the position (y. 6) of the PAT specifies that the above computation is performed by the processor Before introducing various designs for parallel algorithms, we first provide some definitions.
The utilisation U of processors in an algorithm is the average fraction of time that the processors are busy performing operations. Utilisation is computed as follows.
Let K be the number of processors, T be the execution time, in units of z, of the algorithm, N be the number of primitive operations in the algorithm, T be the computation time of a primitive operation, then
N T U = -KT
We use the following naming convention to specify various parallel algorithms. We divide the name into two parts. The first part specifies the type of algorithm and the second part specifies the selected projection direction. 
Systolic array Sk
There have been many papers dealing with the design of conventional systolic arrays, so we omit it. A possible design instance (n = 4) of the TLT and PAT is shown in Table l Table 2 . From Table 2 , we know that the utilisation of each processor is very low.
Systolic array Si
To increase the utilisation of each processor, if we use i-direction as the projection direction and use Tables 3a and b as the TLT and PAT, then we obtain the parallel algorithm Algorithm Si shown in Fig. 3 . This algorithm is also implemented on a conventional systolic array. Fig. 4 shows the operations performed by processors. Circular processors are used to compute aiLs and rectangular processors are used to compute both p ;~ and eijk's. Total execution time is also 5n/2 -2 time steps, so the utilisation of each processor is n/(5n/2 -2). This design is similar to the Winograd matrix multiplication array designed by Jagadish and Kailath [SI. Then, we find a PAT compatible with the TLT we have just constructed. After determining the TLT and PAT, we can obtain a parallel algorithm. A possible design instance of the TLT and PAT is shown in Tables 4a and b. It corresponds to the parallel algorithm Algorithm Ck shown in Fig. 5 . This algorithm is implemented on a cylindrical array. The total execution time is now reduced to 3n/2 -1 time steps. The utilisation of each processor is 4 2 (3n/2 -1).
Two-layered mesh array Xk
If we use Table 4a and Table 5 as the TLT and PAT, then we obtain the parallel algorithm Algorithm Xk shown in Fig. 6 . Total execution time and the utilisation of each processor is the same as Algorithm Ck, but this architecture uses local connections instead of global con-nections. The execution sequence of this algorithm is shown in Table 6 . links to drain out cij of C from the array. This algorithm is the same algorithm as that is proposed by Benaini and Robert [4] . Execution sequence of this algorithm is shown in Table 7 . Comparing Table 6 with Table 7 , we see that the TLT of Algorithm MXk is the same as that of Algorithm Xk, but the utilisation of each processor for Algorithm MXk is n/(3n/2 -l), which is twice as much that is achieved by Algorithm Xk.
Cylindrical array Ci
If we use Tables 8a and b as the TLT and PAT, then we obtain the parallel algorithm Algorithm Ci shown in Fig.  8 . The number of time steps required for this algorithm is 2n -1. The utilisation of each processor is n/(2n -1).
Torus array
Tk I f we use Tables 9a and b as the TLT and PAT, then we obtain the parallel algorithm Algorithm Tk shown in Fig.  9 . The steps of designing a torus array algorithm is shown in the following:
b34 b44 aL3a44 b33 b43 a33 a34 b14 b24 a41 a42 b13 b23 a31 a32 
Modified two-layered mesh array MXk
Because the array of Fig. 6 is symmetrical to the central horizontal line, we can use the cut-and-pile method [20] by the central horizontal line to obtain the algorithm Algorithm M X k shown in Fig. 7 , where we add vertical 
where [t,,J is an ordered or a permuted Latin square and t is a Latin cube [19] .
(ii) According to the data flow dependence graph, we can find a PAT compatible with the above TLT t. After deciding the TLT and PAT, we obtain a torus array algorithm for the parallel Winograd algorithm. The number of time steps required for this algorithm is n. The utilisation of each processor is i.
Torus array Ti
If we use Tables 10a and b as the TLT and PAT, then we obtain the parallel algorithm Algorithm Ti shown in Fig.  10 . The number of time steps required for this algorithm is n. The utilisation of each processor is 1. a13 a14 b31 b41 b32 b42 a23 a24 a33 a34 b33 b43 b34 b44 a43 a44 a l l a12 b l l b21 b12 b22 a21 a22 a31 a32 b13 b23 bl4 b24 a41 a42
Fig. 6 Two-layered mesh array for the Winograd algorithm a13 a14 b31 b41 b32 b42 a23 a24 a33 a34 b33 b43 b34 b44 a43 a44 a l l a12 b l l b21 b12 b22 a21 a22 a31 a32 b13 b23 b14 b24 a41 a42 Table 11 . The results show that although systolic arrays (which execute systolic algorithms) have simpler wirings, their execution times are longer than the others and the utilisation of their processors are lower than the others. Nonplanar arrays, such as cylindrical array and torus array, have better performance as compared with systolic arrays. However, they have more complex wirings. Among these arrays, the torus array Ti is the most efficient one, because each processor of the array is fully utilised. In Table 11 , the estimation of time is based on the assumption that the operations are synchronised at the cell level. In near future, the proposed approach will be adopted to design parallel Winograd algorithms which are executed on arrays synchronised at operator level. 
Fig. 7 Modifred two-layered mesh arrayfor (he Winograd algorithm

