Abstract
Introduction
Linear algebra and matrix computation play a central role in modern digital signal processing. Representative areas in which matrix computation is important include beamforming, direction finding, spectrum analysis, image processing, etc. For applications with substantial computational requirements, e.g. when a real time implementation is aimed at, parallel algorithms and architectures are indispensable. Systolic or wavefront arrays then provide a means of exploiting large amounts of parallelism and hence achieving orders of magnitude improvement in performance, consistent with the requirements 1 .
Practical parallel algorithms for most of the key matrix operations are by now available. As for the matrix decomposition problems, most often methods are selected that are based on locally computed plane transformations. All of these are inspired by Jacobi's algorithm for the symmetric eigenvalue decomposition (SEVD). Parallel Jacobitype algorithms are available for the Schur decomposition 2 , the singular value decomposition (SVD) 3 , generalized SVD's, the QR decomposition, etc.
In the next section, we first briefly review the conventional systolic implementation for Jacobi-type algorithms. The processor utilization is seen to be particularly low, as each processor is active only one fourth of the time. Next, we show how a different partitioning results in a completely different array, which is over four times more efficient.
Systolic arrays for Jacobi-type algorithms * This work was sponsored (partly) by the BRA 3280 project of the EC.
Jacobi-type algorithms are mostly iterative algorithms, that solve matrix decomposition problems by repeatedly solving smaller 2 × 2 subproblems 2 . Figure 1 exhibits the prototype systolic implementation for various such Jacobi-type algorithms. Each dot represents a matrix element. Without loss of generality, we assume that the matrix is reduced to triangular form 3 . For square matrices, the lower triangular part is merely a replica of the upper triangular part.
The array operations are then as follows 2 : -On the main diagonal relevant 2 ×2 transformations are computed, e.g. for an SVD annihilating the available off-diagonal element. With each 2 × 2 block, a pivot index is associated, which equals the row index (column index) of the element in the upper left corner. Row transformation parameters are passed on to the right, while column transformation parameters are passed on upwards.
-Off-diagonal blocks only apply these transformations and propagate them to the blocks next outward.
In Phase c, the first set of rotations has moved far enough to allow the next set of rotations to be generated, etc. Alternately, 2 × 2 blocks corresponding to odd and even pivot indices are selected (odd-even ordering).
From this description, it seems natural to think of a computational network that consists of a triangular array of storage cells (dots in Figure 2) , each corresponding to one matrix element, together with a grid of processors (squares in Figure 2 ), corresponding to 2 × 2 subblocks 2;4 . Each processor has access to four surrounding storage cells. The number of processors is roughly half the number of matrix elements. The processors are interconnected horizontally and vertically in order to allow the transformations to pass through the matrix. Each processor is active during only one of the phases a, b, c or d (Figure 1 ).
An hexagonally connected grid
As each processor in the original configuration is thus active for only one fourth of the time, it is obvious that four contiguous processors can be merged into one new processor. Remarkably enough, this merger completely changes the outlook of the array. Figure 3 exhibits how processors can be merged, and indicates "activity regions" for the new processors. Each processor now has access to twelve storage cells. It alternately performs operations on the left-most, right-most, lower and upper 2 × 2 block in its activity region, corresponding to phases a-b-c-d in Figure 1 respectively. As each processor has 4 matrix elements "of its own", plus 8 elements shared with one other processor, the number of processors equals one eight of the number of matrix elements
What is remarkable about this partitioning, is that the original orthogonal processor configuration is turned into an hexagonally connected grid. From Figure 3 , it follows that each processor has six neighbours, with which it exchanges either column transformations (neighbour 2 and 5), row transformations (neighbour 3 and 6), or both row and column transformations (neighbour 1 and 4). The resulting grid is seen to be hexagonally connected, with a principal axis parallel to the main diagonal (Figure 4) . Apart from the boundary processors which associate with fewer storage cells, all processors are active throughout, so that the overall efficiency is roughly 100%.
Discussion
The alternative partitioning provides a means of doing the same job in the same time period, with less hardware. There are still a few additional savings as follows :
-Due to the fact that half of the matrix elements are associated with only one processor, the corresponding storage cells actually represent internal memory. The number of external memory accesses is therefore roughly halved.
-The row transformation for phase a is used in the same processor in phase b. In between phases a and b, thus only column transformation parameters need to be communicated from one processor to another. The same holds true for phases c and d, where only row transformations need to be communicated. As a result, the interprocessor communication time is reduced to roughly 75%.
-Finally, let us only mention that due to the fact that each processor is active throughout, it is possible to locally pipeline or parallelize certain operations within each processor.
In conclusion -and somewhat contradictory-it is seen that an increased throughput can be attained with roughly four times less hardware. 
