Introduction
In this paper, we are concerned with the parallel implementation on distributed memory MIND parallel computers of the LAPACK routines for performing the reduction to Hessenberg form and the reduction to tridiagonal form. These reductions are an important first step in the computation of the eigenvalues of matrices.
The LAPACK project is an effort to update the classical linear algebra software packages LIN-PACK and EISPACK to allow more efficient use of shared memory or traditional supercomputers. Efficiency is attained by writing these routines as much as possible in Level 2 and 3 BLAS [S, 61, reducing the ratio of memory accesses to floating point operations executed and allowing for encapsulation of parallel operations on shared memory architectures.
While parallel implementations of algorithms for solving linear systems have been widely studied [4, 91, the reduction to condensed form has not enjoyed the same attention. A p a r d e l unblocked Hessenberg reduction algorithm based on column wrapped storage is given in [lo, 111. In [8], a reduction based on Gaussian transformations is reported. The reduction of symmetric matrices assuming row wrapped and grid wrapped storage is addressed in [2, 31. Our approach is different in that we start with highly efficient sequential code [7] . Efficiency on each node is attained by use of Level 1, 2, and 3 BLAS. Communication is through a proposed communication library, the Basic Linear Algebra Communication Subprograms (BLACS) [l] , which makes the code portable.
The paper is organized as follows: Assumptions and notation are given in Section 2. As an introduction t o the parallel implementation of blocked algorithms, unblocked algorithms and their parallel implementation are given in Section 3. Blocked versions are discussed in Section 4. Results from experiments on the Intel Touchstone Delta system can be found in Section 5. Concluding remarks are given in the final section.
Assumptions and Notation
We wiU assume that our multicomputer consists of p nodes, labeled Po,. . . , Pp-l which are connected by some communication network that allows broadcasting of messages and combining of global data (in the form of global summation).
For our formulae, we adopt the following notation: Scalars, vectors, and matrices are denoted by lower case Greek, lower case, and upper case arabic letters, respectively. The ith element of a vector is denoted by a corresponding greek letter with subscript i (x;, vi, Y;, and u; for vectors z, y, a, and o, respectively). Given a vector z, the vector consisting of its elements i, . . , j is denoted by xi:,. Given matrix A , the submatrix consisting of elements of rows i, . . . , j and columns k, .. . We will use the following mapping of matrices to nodes: Given A E Rnxn and panel width m 2 1, assume for simplicity that n = T t m and partition where Aj (k) E Rnx" is a panel of width m. The panel-wrapped storage scheme assigns A?' to node P(j-l)mo+. I.e., A;+1, A i + p + l , . . . are assigned to P;. If m = 1, the result is the familar coIurnnwrapped storage scheme [9] . For notational convenience, we define j E P; to be true if and only if column j of the matrix is assigned to node Pi.
The basic operations utilized by the reduction algorithms are the computation and application of Householder transformations:
Theorem 1 Given a vector x E R", one can find a vector u E K" and scalar p s.t.
(1 -PuaT)z = ( X I , . . . , X k , f?), 0,. . -7 o)T where 77 = (lzk+1:nl12-Indeed, u = (0, . . . , 0, x k + l q , x k + 2 , . . . , x , )~ and p = 2/uTu will give the desired result. The sign is chosen to correspond to the sign of X k + l , thereby minimizing roundoff error in the computation of u.
The transformation I -PuuT will subsequently be denoted by H ( k ) ( z ) , where here the superscript indicates that elements XI,. . . , Xk are not affected. This notation is consistent with the previous use of superscripts since in the reduction algorithms the Householder transformation computed during the kth iteration has this property. We will also use the pair (u,,f?) to denote the transformation, Le., (u,,f?) = H ( k ) ( s ) will denote the vector u and scalar p s.t. H ( k ) ) z )
Since u and p are not uniquely defined, we wjU always take u to be normalized so that it has a unit kth element.
Unblocked Algorithms
In this section, we explain how simple algorithms for the reductions to Hessenberg and tridiagonal forms for the eigenvalue computation can be implemented on sequential and parallel architectures.
Sequential Implementation: Hessenberg Reduction
The reduction of matrix A(') = A to Hessenberg form can be written as A("-'), where where vT = and
This yields the following algorithm for reducing a matrix to Hessenberg form:
3.2
If A is symmetric, then Equations (1) can be replaced by y = PAu and = w = y -1/2,f?uTyu, and the matrix is being reduced t o tridiagonal form. In this case, it is only necessary to update the lower triangular part of matrix A at each iteration.
Sequential Implement at ion: Tridiagonal Reduct ion

Parallel Implementation: Hessenberg Reduction
Given p processing nodes P o , . . . , Pp-l, our parallel implementation will assume that the columns of A have been assigned to the nodes in column-wrapped fashion. This choice of assignment allows us to parallelize Algorithm 2 as follows:
1. For all k, updating of column j of matrix A is performed by node P(j-l)modp.
2.
During the kth iteration, the computation of ( u , p ) is performed by P; such that k E P;, , i.e., P(k-l)modp, after which it is distributed to all nodes.
3.
Subtracting the j t h column of PuvT from column j requires only j t h element of v, u j , to be known to the node that owns column j. This is convenient, since uj = uT[A]*,j, which can be formed by this node once u has been received. This means can be computed in parallel, leaving the different elements of II on the nodes that computed them.
4.
Subtracting the j t h column of PwuT from column j requires both vj and w = Au to be known to node P(j-l)modp. Vector w E Rn is computed as follows: Let B; equal the columns of A that are assigned to node P;. If the corresponding elements of u are appropriately packed into a vector U T , then Av = x&,odgy;, where y; = B;uT. Hence Au can be formed by first computing partial results y; in parallel on all nodes, followed by a global summation of the partial results, leaving Au on all nodes. Next, uTAu = vTy and w can be formed. Notice that there is some (insignificant) redundant computation in this last step, since all nodes perform the same computation.
The resulting parallel implementation of Algorithm 2 is given by the following pseudo-code that drives each node P;:
( 1 7 gsum w = y;
(15)
so that all processors participate in subtracting P(uTy)u before the global summation.
Parallel Implementation: Tridiagonal Reduction
Parallel implementation of the reduction to tridiagonal form for a symmetric A proceeds similarly, with one major difference: Since only the lower triangular part of matrix A contains useful information, we compute y as follows: Let A = L +-R, where L and R equal the lower triangular and strictly upper triangular parts of A, respectively. Notice that RT equals the strictly lower triangular portion of L , and hence both are assigned to nodes in column-wrapped fashion. Now y = Au = Lu + Ru can be computed by:
Blocked Algorithms
In [7] it is shown how reorganizing portions of the above algorithms in terms of Level 3 BLAS yields algorithms that perform considerably better on computers with vector processors and/or hierarchical memories. In this section we discuss sequential blocked algorithms for reduction to Hessenberg and tridiagond form as well as their parallel implementation.
4.1
We first consider how the application of rn Householder transformations can be combined:
Sequential Implementation: Blocked Hessenberg Reduction
The general strategy for reorganizing Algorithm 2 now becomes:
1. Partition the matrix into panels of width rn. Notice that the third step can now be written as two matrix-matrix operations. The bulk of the formation of the matrices requires m matrix-vector operations.
4.2
The blocked algorithm for the reduction to tridiagond form for the symmetric problem is reorganized similarly, except that in this case W = V , so Equation 
4.3
We now describe the parallel implementation of the blocked reduction t o Hessenberg form. We will use panel-wrapped storage, where the panel width corresponds to rn, the width of the panel used for the sequential blocked algorithm.
Understanding how to perform the computation in parallel is closely related to how matrices U , V , and W must be distributed in order to be able to perform the update in If we update Aik) on node P(j-l)mo+, then U , W and V, must be known to this node. Hence we must compute these matrices in such a way that U and W eventually reside on all nodes, while VT is panel-wrapped distributed among the nodes.
Finally, we examine how the computation of U , V, and 14' can be distributed among the nodes.
Assume the computation has progressed to where panel s is being reduced, Le., k = (s -1)nz + 1.
Assume the first j columns of U , V , and W have been computed and are distributed as desired.
The computation of the (j -+ 1)st column of these matrices proceeds as follows:
1. On node P(s-l)mo~p, form the ( j t 1)st column of the current panel of A(k+j): 
The first requires partial sums of vectors to be accumulated on each processor, followed by a global summation of the results, leaving the results on all processors. The latter two can either be computed in the same way or they can be computed separately on each processor, leading to redundant computation, but less communication overhead. The space between two curves equals the time spent in the indicated operation. The times for the global sum (GSUM) and broadcast (BCAST) include some idle time that is due to load imbalance.
Parallel Implementation: Blocked Tridiagonal Reduction
The parallel implementation of the reduction to tridiagonal form for symmetric A proceeds similarly. 
Experiments
In this section, we report the performance of the parallel reduction algorithms on the Intel Touchstone Delta system using the Portland Group compiler and assembly coded single precision BLAS routines by Kuck and Associates. The Intel Touchstone Delta system is a distrjbuted-memory, message-passing multicomputer of the Multiple Instruction Multiple Data (MIMD) class developed jointly by the Defense Advanced Research Projects Agency (DARPA) and the Intel Corporation Again, the space between two curves equals the time spent in the indicated operation.
Gigaflops double precision, M 40 Gigaflops single precision, and an aggregate system memory of = 8
Gigabytes. The interconnection network employs a Mesh Routing Chip (MRC), developed at the California Institute of Technology, at each system node. Each MRC provides five channels, one for its associated node and four for its adjacent neighbors in the two-dimensional mesh. The channels are comprised of two, unidirectional buses: one for data flow into the MRC, one for data flow out of the MRC. The peak interprocessor communications bandwidth is z 30 MBytes/s in each direction.
The system supports explicit message-passing, with a latency of w 75 microseconds via worm-hole routing using a packet-based protocol. Interconnect blocking is minimized by interleaving packets associated with distinct messages which need to traverse the same interconnect path. Figure 1 shows the performance of the parallel reduction to Hessenberg form as a function of the problem size n and the block size nb for p = 128. Performance is most influenced by the performance of the Level 2 and 3 BLAS. From this graph, it can be concluded that nb = 3 yields reasonable performance. We will use this block size in subsequent discussions.
Reduction t o Hessenberg Form
Communication overhead is the main contributor to the reduction in performance, as can be seen from Figures 1 and 2. In particular, the global summation and broadcast operations are major contributors to the total execution time. This is not supprising, considering a broadcast of a vector of length O ( n ) and global summation of vectors of length n is required for each column of W that is formed (in addition to the summation of at least one smaller vector).
The performance attained as a function of problem size is clear from Figure 3 . In this graph, nb = 3 and performance is given for various numbers of nodes. The overall performance is somewhat disappointing: The LAPACK reduction routine on a single processor attains about 45 MFLOPS.
DELTk TriRaJnCIZ
i------- 
Reduction to Tridiagonal Form
Figure 1 also shows the execution time for the parallel reduction to tridiagonal form. From this graph, it can be concluded that large block sizes yield better performance. This is due to the fact that during the update given by Equation 2 the submatrix must be updated one panel at a time, since the lower triangular part of the matrix A is wrapped onto the processors. For the same reason, the performance of the matrix-vector product (BLAS2) is affected.
The overall performance of the reduction to tridiagonal form is worse than that of the reduction to Hessenberg form (Figure 3) . This citn be explained as follows: The number of floating point operations is reduced by a factor 2.5 as compared to the reduction to Hessenberg form. The time spent in the broadcast is unchanged. The time spent in the global summation is approximately halved. As a result, the ratio of communication t o computation is higher than for the reduction to Hessenberg form.
Conclusion
We have demonstrated that the LAPACK code for reducing a matrix to Ilessenberg or tridiagonal form can be rewritten for current generation MIMD distributed memory computers in a relatively straight forward manner.
On the Intel Touchstone Delta, efficiency is hampered to a large degree by the cost of communication and the synchronous nature of the algorithm. If larger problems are solved, this becomes less significant. Although the Intel Touchstone Delta system has sufficient memory to store matrices of order 25000, we limited ourselves to problems that required less than 30 minutes to complete.
We have started to investigate different methods for mapping matrices to nodes. In [13], we show that wrapping onto logical tori greatly improves the performance of the LU factorization on the Intel Touchstone Delta. Future work will include the investigation of using this storage method for the reduction algorithms.
[lo] G.W. Juszczak 
100.
101.
102.
103.
104.
105.
