I. INTRODUCTION
OLUTION of many scientific and engineering problems S requires large amounts of computing power. With advances in VLSI and parallel processing technology, it is now feasible to achieve high performance and even reach interactive or real-time speeds in solving complex problems. The drastic reduction in hardware costs has made parallel computers available to many users at affordable prices. However, in order to use these general purpose computers in a specific application, algorithms need to be developed and existing algorithms restructured for the architecture. The finite element method [I] is a powerful numerical technique for solving boundary value problems involving partial differential equations in engineering fields such as heat flow analysis, metal forming, and others. As a result of finite element discretization, linear equations in the form A x = I, are obtained where A is large, sparse, and banded with proper ordering of the Manuscript received February 12, 1988; revised July 11, 1988 . This work was supported by an Air Force DOD-SBIR Program Phase I1 (F33615-85-C-5198) through Universal Energy Systems, Inc., and by the National Science Foundation under Grant CCR-870507 1.
C. Methods for solving such equations on sequential computers [6] can be grouped as direct methods and iterative methods. Since the coefficient matrix A is very large in these applications, parallelization by distributing both data and computations has been of interest. The Conjugate Gradient (CG) algorithm is an iterative method for solving sparse matrix equations and is being widely used because of its convergence properties. The sparsity of the matrix is preserved throughout the iterations and the CG algorithm is easily parallelized on distributed memory multiprocessors [ 1 I.
In this paper, solution of such equations on distributedmemory message-passing multiprocessors implementing the hypercube [2] topology is addressed. In such an architecture, communication and coordination between processors is achieved through exchange of messages. A d-dimensional hypercube consists of P = 2d processors (nodes) with each processor being directly connected to d other processors. A four-dimensional hypercube with binary encoding of the nodes is shown in Fig. 1 . Note that the binary encoding of a processor differs from that of its neighbors in one bit. The processors that are not directly connected can communicate through other processors by software or hardware routing. The maximum distance between any two processors in a ddimensional hypercube is d . It has been shown that many other topologies such as meshes, trees, and rings can be embedded in a hypercube [7] .
Achieving speedup through parallelism on such an architecture is not straightforward. The algorithm must be designed so that both computations and data can be distributed to the processors with local memories in such a way that computational tasks can be run in parallel, balancing the computational loads of the processors as much as possible [8] . Communication between processors to exchange data must also be 0018-9340/88/1200-1554$01 .OO 0 1988 IEEE considered as part of the algorithm. One important factor in designing parallel algorithms is granularity [9] . Granularity depends on both the application and the parallel machine. In a parallel machine with a high communication latency, the algorithm designer must structure the algorithm so that large amounts of computation are done between communication steps. Another factor affecting parallel algorithms is the ability of the parallel system to overlap communication and computation. The implementation described here achieves efficient parallelization by considering all these points in designing a parallel CG algorithm for hypercubes designed for coarse grain parallelism. In Section 111, communication requirements of different schemes for mapping finite element meshes onto the processors of a hypercube are analyzed with respect to the effect of communication parameters of the architecture. 
6
( 2) where
with unit diagonal entries P = D1'2x and 6 = D -II2b. Thus, b is also scaled and 1 must be scaled back at the end to obtain x. The eigenvalues of the scaled matrix A are more likely to be grouped together than those of the unscaled matrix A , thus resulting in a better condition number [12] . Hence, in the Scaled CG (SCG) algorithm, the CG method is applied to (2) obtained after scaling. The scaling process during the initialization phase requires only = 2 x z x N multiplications, where z is the average number of nonzero entries per row of the A matrix. Symmetric scaling increases the convergence rate of the basic CG algorithm approximately by 50 percent for a wide range of sample metal deformation problems. In the rest of the paper, the scaled linear system will be denoted by Ax = b .
MAPPING CG COMPUTATIONS ONTO A HYPERCUBE
The effective parallel implementation of the CG algorithm on a hypercube parallel computer requires the partitioning and mapping of the computation among the processors in a manner that results in low interprocessor communication overhead. This section first describes the nature of the communication required, outlines two approaches to mapping the computation onto the hypercube processors, and then evaluates their relative effectiveness as a function of communication parameters of the hypercube multiprocessor system.
A . Communication Requirements of the CG Algorithm
The communication considerations in distributing the CG algorithm among the processors of a distributed-memory parallel computer may be understood with reference to Fig. 2 . Fig. 2(b) displays the structure of a sparse matrix resulting from the finite element discretization of a simple rectangular region shown in Fig. 2(a) . The discretization uses four-node rectilinear elements. In Fig. 2(a) , the diagonals of the finite elements are joined by edges to give a finite element interaction graph, whose structure bears a direct relation to the zero-nonzero structure of the sparse system of equations that characterizes the discretization. Each node in a 2-D finite element graph is associated with two variables, corresponding to two degrees of freedom. Each nodal degree of freedom has a corresponding row in the matrix A and is associated with a component in the vectors x, b, r , p , and q. Furthermore, it can be seen that the nonzeros in that row (column) of A occur only in positions corresponding to finite element nodes directly connected to that node in Fig. 2 (a). The figure shows only a single point corresponding to the two degrees of freedom of a node. The matrix A and vectors x, b , r , p ? and q are shown partitioned and assigned to the processors of a two-dimensional hypercube. The partitioning of the matrix A and vectors x, b, r , p , and q can equivalently (and more conveniently) be viewed in terms of the partitioning of the corresponding nodes of the finite element interaction graph itqelf. as shown in Fig. If the values of cyk and (Ik are known at all the processors, the vector updates in steps 3, 4, and 6 of the CG algorithm can clearly be performed very simply in a distributed fashion without requiring any interprocessor communication. The individual pairwise multiplications for the dot products in steps 2 and 5 can also be locally performed in each processor. If each processor then forms a partial sum of the locally generated products, a global sum of the accumulated partial sums in each of the processors will result in the required dot product. Considering the arithmetic computations required in Steps 2-6, if the rows of A are evenly distributed among the 2(a). processors, each processor will perform exactly the same amount of computation per phase of the CG algorithm. Thus, with respect to Steps 2-6 of the algorithm, any balanced mapping of the finite element nodes among the processors is essentially equivalent in terms of the total amount of computation and communication. This however is not the case as far as step 1 goes, as discussed below.
Step 1 of the algorithm requires a sparse matrix-vector product. This involves the formation of the sparse dot product of each row of A with the dense vector p , necessitating interprocessor communication to obtain necessary nonlocal components of the p vector. Due to the relation between the nonzero structure of A and the interconnection structure of the finite element interaction graph, the interprocessor communication required is more easily seen from Fig. 2 (a) than directly from Fig. 2 (b)-two processors need to communicate if any node mapped onto one of them shares an edge with any node mapped onto the other. Thus, the interprocessor communication incurred with a given partitioning of the matrix A and the vectors x, b, r, p , and q can be determined by looking at the edges of the finite element interaction graph that go across between processors. Therefore, in treating the partitioning of the sparse matrix A for its efficient solution using a parallel CG algorithm, in what follows, the structure of the finite element graph or associated finite element interaction graph is referred to rather than the structure of the A matrix itself.
The time taken to perform an interprocessor communication on the Intel iPSC/2 system is the sum of two components-1) a setup cost Sc that is relatively independent of the size of the message transmitted, and 2) a transmission cost Tc that is linearly proportional to the size of the message transmitted. Thus,
Tc,,,,,, = sc + I x Tc where I is the number of words transmitted. The setup cost is essentially independent of the distance of separation between the communicating processors, but is a nontrivial component of the total communication time unless the message is several thousand bytes long. Since an additional setup cost has to be paid for each processor communicated with, in attempting a mapping that minimizes communication costs, it is important to minimize not only the total number of bytes communicated, but also the number of distinct processors communicated with. The first of the two mapping schemes described, the 1-D stripmapping approach [ 131, minimizes the number of processors that each processor needs to communicate to, while simultaneously keeping the volume of communication moderately low. The second scheme, the 2-D mapping approach [13] , lowers the volume of communication, but requires more processor pairs to communicate. The two schemes are compared and it is shown that for the values of the communication parameters of the Intel iPSC/2 and the range of problem sizes of current interest, the 1-D strip mapping scheme is the more attractive one.
B. 1-D Strip Partitioning
The I-D strip-mapping scheme attempts to partition the finite element graph into strips, in such a way that the nodes in any strip are connected to nodes only in the two neighbor strips. By assigning a strip partition to each processor, the maximum number of processors that any processor will need to communicate with is limited to two. The procedure can be understood with reference to Fig. 3 . The finite element graph shown has 400 nodes. A load-balanced mapping of the mesh onto a two-dimensional hypercube with four nodes is therefore 100 nodes per processor. Starting at the top of the leftmost column of the mesh, nodes in that column are counted off, until 100 nodes are visited. Since the number of nodes in the leftmost column is less than 100, the column immediately to its right is next visited, starting again at the top. By so scanning columns from left to right, 100 nodes are picked off and assigned to Po. Continuing similarly, another strip of 100 nodes is formed and assigned to P I . Pi and Pz, respectively, are assigned the next two such strips.
Thus, by only using a subset of the links of the hypercube, a linear chain of the processors is formed and adjacent strips generated by the strip mapping are allocated to adjacent processors in the linear chain. If the finite element mesh is large enough, such a load-balanced 1-D strip mapping is generally feasible. The scheme described above can be extended to more general rectilinear finite element graphs that cannot be embedded onto a regular grid; details may be found in [13] .
C. 2 -0 Orthogonal Strip Partitioning
The partitions produced by 1-D strip mapping tend to require a relatively high volume of communication between processors due to the narrow but long shape of typical strips. The 2-D orthogonal partitioning method attempts to create partitions with a smaller number of boundary nodes, thereby reducing the volume of communication required. It involves the generation of two orthogonal 1-D strips. The hypercube parallel computer is now viewed as a P I x P2 processor mesh.
A PI-way 1-D strip and a P2-way 1-D strip in the orthogonal direction are generated, as illustrated in Fig. 4 for mapping the mesh of Fig. 3 onto a 16-processor system. Partitions are now formed from the intersection regions of the strips from the two orthogonal 1-D strips, and can be expected to be more "square" (and consequently have a lower perimeteriarea) than those generated by a 16-way 1-D strip mapping. It can be easily shown that the generated partition satisfies the "nearest neighbor" property [13], i.e.. each such partition can have connections to at most eight surrounding partitions. Furthermore, by using a synchronized communication strategy between mesh-connected processors, whereby for each iteration, all processors first complete communications with horizontally connected processors before communicating with their vertically connected processors, each processor needs to perform at most four communications [ 131.
While each of the two orthogonal 1-D strip partitions is clearly load balanced, the intersection partitions in Fig. 4 are definitely not. Such a load imbalance among the intersection partitions can in general be expected. Consequently, the 2-D strip partitioning approach employs a second boundary refinement phase following the initial generation of the 2-D orthogonal strip partition. The boundary refinement procedure attempts to perform node transfers at the boundaries of partitions in such a way that the nearest neighbor property of the initial orthogonal partition is retained. The resulting partition after boundary refinement for the chosen example mesh is shown in Fig. 5 . Details of the boundary refinement procedure and generalization of the orthogonal 2-D mapping procedure for nonmesh finite element graphs may be found in ~3 1 .
D. Comparison of I-D Versus 2 -0 Partitioning
In this subsection, the 1-D and 2-D approaches are compared with respect to the communication costs incurred for the matrix-vector product of step 1 of the CG algorithm. To facilitate a comparison, first a simple analysis is made for the case of a square mesh finite element graph with "in" nodes on a side. The communication costs with a I-D strip partition and a 2-D partition are formulated. A special case of 2-D orthogonal strip partitioning, where a two-way partition is made along one dimension, is also treated [ Fig. 6(c) ]. This special case is interesting since it requires a maximum of three communications by any processor in any iteration, in contrast to two and four, respectively, for the 1-D and general 2-D case. Thus, the setup cost incurred with this special 2-D mapping is in between that of the other two, and the communication volume is also somewhere in between. This special case of 2-D orthogonal partitioning is hence referred to as a 1.5-D partition.
The load-balanced partitioning of an m x m node finite element mesh is shown in Fig. 6(a), (b) , and (c) for the 1-D, 2-D, and 1.5-D cases, respectively. The use of a synchronized interprocessor communication strategy alluded to earlier can be used with the 2-D and 1.5-D cases. By performing all horizontal communications before vertical communications, the values to be transferred diagonally can be transmitted in a store-and-forward fashion, without incurring an additional setup cost for the diagonal communications. Thus, the number of transfers required between diagonally related processors in the mesh is added on to the volume of the intermediate processor's communication. The number of variables is twice the number of nodes in the sample finite element problems used here. In the 2-D case then, each interior processor has an additional eight values added to its total communication volume, corresponding to the four diagonal transfers from its neighbors that it facilitates through a store-and-forward transmission. By referring to Fig. 6 
(6) The relative merit of one scheme over the other is a function T~LI = 4 X SC+ (8 + 4m/Pl + 4m/P2) T, P,, Pz> 2 T, ,511 = 3 x SC + (4 + 2m t 4m/P) T,.
of P, SC, and T,, and m as follows From these inequalities, it is concluded that for P = 16, P I = P 2 = 6 = 4 the optimal approach is 1-D strip partitioning, i for m<?x 8 (%+2) I 1.5-D strip partitioning,
The above simplified analysis precludes the possibility of overlap between multiple out-bound communications from a processor. While such overlap is possible, the setup times for the individual communications are truly additive and cannot be overlapped. A more detailed analysis assuming overlap between the transmission times with succeeding setup times provides results similar to the above simplified analysis with respect to the ranges of m where each of the above schemes is optimal.
Using experimentally measured values for Sc = 970 ps and Tc = 2.88 ps per double-precision number, it is seen that the 1-D approach is superior to the other two for m < 194. This value of m is well above mesh sizes of interest in the context of a practically realistic finite element solution. While the above analysis considered a specific shape of a finite element graph, it provides a good estimate of the order of magnitude of the finite element graph size that makes the 1.5-D or 2-D approaches worth using for a parallel finite element solver on the Intel iPSC/2. Table I summarizes the results obtained with 1 1 TI .5 < T2 iff m < ( + 2) x (1 +2/P-2/P,-2/P,)
For the case of a 16-processor hypercube system, we obtain P = 1 6 , P l = P 2 = J 1 6 = 4 (7)
a number of finite element graphs using the three approaches. The total volume of communication required by the partition with the largest boundary is reported, as well as the predicted communication time for the local communication phase in each case. It can be seen that for every one of the examples, the partitions produced by the 1-D approach are clearly superior. As a consequence, only the communication protocol required by 1-D strip partitions was actually implemented on the Intel iPSC/2 system for the parallel CG algorithm. The formulation, implementation, and experimental performance measurement of the parallel CG algorithm are treated in the following sections.
Sample Problem
Yo. Strong data dependencies exist in the basic SCG algorithm which limit the available concurrency. The distributed inner product computation ( P k , q k ) which is required for the computation of the global scalar f f k cannot be initiated until the global scalar b k -1 is computed. Similarly, the inner product ( r k + [ , r k + I ) which is required for the computation of the global scalar b k cannot be computed until the global scalar f f k is computed. During each SCG iteration, three distributed vector updates which involve no communication and one matrix-vector product which involves only local communication cannot be initiated until the updating of these global scalars is completed. Hence, these data dependencies due to the inner product computations introduce a fine grain parallelism which degrades the performance of the algorithm on the hypercube.
Mesh Mesh
The SCG algorithm requires the computation of two inner product terms at each iteration step. These inner product calculations require global information and thus are inherently sequential. The Global Sum (GS) and Global Broadcast (GB) operations [7] both require d communication steps and introduce a large amount of interprocessor communication overhead, due to the large setup time for communication. Furthermore, only one floating point word is transferred and one inner product term is accumulated during the GS-GB communication steps due to the data dependencies in inner product computations. New formulations of the SCG algorithm that allow the parallel computation of the inner products are discussed in [lo] and [14] . The formulation described in [lo] is used here with a different motive, namely to reduce the number of setups for communication. This coarse grain SCG algorithm enables the computation of two inner products in one GS-GB step as described in the next section. Further improvement is obtained by using a global sum algorithm that takes advantage of the two physical links between connected processors, in the architecture, to overlap communication in two directions.
A . CG-SCG f o r s = I
The rationale behind this formulation is to rearrange the computational steps so that two inner products can be communicated in each GS-GB communication step after computing the distributed sparse matrix-vector product q k = A p k . The problem is to find the expressions for the global scalars f f k and @ k in terms of these inner products. Using the recurrence relation given in Step 3 of the SCG algorithm, we have Using the recurrence relation defined for r k + once more, (11) is obtained. Now using the recurrence relation defined for P k in step 6 of the SCG algorithm, (11) into (10) one can obtain -1.
Therefore, the inner products ( P k , A p k ) and (Apk, A p k ) are needed to compute the global scalars f f k and @ k . The value of the inner product ( r k + l , r k + l ) which is required for the computation of the global scalar f f k + on the next iteration can be computed in terms of the previous inner product ( r k , r k ) using ( r k + l , r k + l ) = @ k ( r k , r k ) .
The initial inner product (ro, ro) is computed using the GS-GB algorithm. Hence, the steps of the coarse grain parallel SCG algorithm can be given as follows.
Choose XO, let ro = PO = b -A x 0 and compute (ro, ro). Then, for k = 0, 1, 2, 1.
form q k = A p k 2. a) form ( P k , q k ) and ( q k , q k ) (in one GS-GB communication step) 3. a) a k = ( r k , r k ) / ( P k , q k ) b) b k = ( a k ( q k , q k ) / ( P k , q k ) ) -1.
cl ( r k + l , r k + l ) = b k ( r k , r k )
4. r k + l = r k -a k q k x k + l = x k + a k P k
Hence, the number of GS-GB communication steps is reduced Numerical results for a wide range of sample problems show that the proposed algorithm introduces no numerical instability and it requires exactly the same number of iterations to converge as the (B-SCG) algorithm. The solution times of different size sample finite element problems for B-SCG and the CG-SCG algorithms, on 1 -4-dimensional iPSC/2 hypercubes, are given in Table 11 .
B. CG-SCG f o r s = 2
The rationale behind this formulation is to accumulate four and ( A ' P z k , A 2 p 2 k ) in a single GS-GB communication step after computing two consecutive distributed matrix-vector products 4 2 k = Ap2, and U2k = A q 2 k and then compute four global scalars a Z k , PZk, a 2 k + I , and PZk+ The derivation of the expressions for these global scalars are too cumbersome and hence are omitted here. The basic steps of the algorithm for s = 2 are given below.
inner products ( P Z k , A p 2 k ) > ( A P Z k , A p 2 k ) > ( A p 2 k 9 A 2 P 2 k ) , 
4. a) r 2 k + l = r2k -a 2 k A P 2 k a 2 k A ' P 2 k e) r 2 k + 2 = r 2 k t -l -a Z k + l A P Z k + l f) P 2 k + 2 = rZk+2 + 0 2 k + l P 2 k + l .
Note the one iteration of the above algorithm corresponds to two iterations of the basic SCG algorithm. Hence, the number of global communication steps is reduced by a factor of four, that is, from twice per iteration to once every two iterations. The scalar computational overhead of this algorithm is 12 multiplications and IO additiodsubtractions per SCG iteration which can be neglected for sufficiently large n, where n = N / P . However, there is also a single vector update overhead per SCG iteration, giving a percent computational overhead of 2 : (1 / ( z + 5 ) ) 100 percent, where z is the average number of nonzero entries per row of the A matrix. For z = 18, where each node interacts with eight other nodes and there are two degrees of freedom per F E node, the overhead is 2 : 4.4 percent. Hence, this algorithm is recommended only for solving sparse linear systems of equations with A matrices having large z, on large dimensional hypercubes. This approach can be generalized for larger s values. However, the computational overhead increases with increasing s and furthermore the numerical stability of the algorithm decreases with increasing s.
C. Comparison of B-SCG and CG-SCG
The solution times of B-SCG and the CG-SCG (for s = 1) algorithms for different size test FE problems on a fourdimensional hypercube, iPSC/2, are given in Table 11 . The percentage performance improvement, 7 = [( TBsCCTc~scc)/Tc~scc] * 100 percent, decreases with the increasing problem size, since the ratio of the GS-GB communication time to the total solution time per iteration decreases by increasing problem size. Here, TBscG denotes the time taken by the B-SCG algorithm and TccscG denotes the time taken by the CG-SCG algorithm. For example, on a four-dimensional hypercube (iPSC2/d4), 7 monotonically decreases from 24 percent for the smallest test problem T1 to 5 percent for the largest test problem T7. It can be also observed from Table I1 that 7 increases with the increasing dimension of the hypercube, since the complexity of the GS-GB algorithm increases linearly with the dimension whereas the granularity of a problem decreases exponentially with the dimension, as the domain is divided among 2d processors. For example, for the smallest size test problem T 1, 7 increases monotonically from 2 percent on iPSC2idl to 24 percent on iPSC2id4. Hence, the proposed CG-SCG algorithm is expected to result in significant performance improvement on large dimensional hypercubes.
D. The Exchange-Add Algorithm
As already mentioned, to compute the inner products in step 2 of the CG-SCG algorithm, the partial sums computed by each processor must be added to form the global sums ( ( p k , q k ) and ( q k , q k ) ) .
Furthermore, since these values are needed in the next step by all processors, they must be distributed to all processors. The global sum algorithm [7] , shown in Fig Fig. 8 . The main idea in this algorithm is that each processor accumulates its own copy of the inner product instead of only one processor accumulating it and then broadcasting. Let the processors of a d-dimensional hypercube be represented by a d-bit binary number (zd-* a , zo). Also, define channel i as the set of (2d-') bidirectional communication links connecting two neighbor processors whose representation only differs in bit position i . Then, the steps of the Exchange-Add algorithm can be given as follows.
Initially, each processor has its own partial sum. This algorithm requires d exchange steps that can be overlapped when two physical links are present. At the end of d exchange steps, each processor has its own copy of the sum. Fig. 9 (a) and (b) illustrates the performance of the GS-GB and EA algorithms with respect to the number of doubleprecision (DP) words (w) added and the dimension of the hypercube, respectively. The node executive (NX/2) of the iPSC/2 handles short messages (I 100 bytes) and long messages (> 100 bytes) differently. Short messages are routed directly, whereas extra handshaking is performed between the two processors participating in the communication to establish the circuit for the transmission of a large message. This extra overhead causes the setup time to increase from Sc = 550 ps to Sc = 970 p s for long messages. This explains the jumps in the curves given in Fig. 9(a) w Tcalc] (TE: = 1 /2 y2-GB for small w). This discrepancy can be attributed to the software overheads for short messages and internal bus conflicts for long messages.
E. Implementation of the Parallel CG-SCG Algorithm
The distributed computations of the CG-SCG algorithm consist of the following operations performed at each iteration: matrix-vector product q k = A p , ( Step I), inner products (pk, 4,) and ( q k , q k ) ( Step 2), and the three vector updates required in Step 4. All of these basic operations are performed concurrently by distributing the rows of A , and the corresponding elements of the vectors b , x, r , p , and q among the processors of the Intel iPSC/2 16-node hypercube. Each processor is responsible for updating the values of those elements of the vectors x, r , p , and q assigned to itself. The 1-D approach has been implemented to distribute the problem due to its superior features for iPSC/2 as discussed earlier. In this scheme, all but the first and the last processors in the linear array have to communicate with their right and left neighbors at each iteration in order to update their own slice of the distributed q vector. The following communication topology is implemented to utilize the two serial bidirectional communication channels between the processors for overlapping these nearest neighbor communications. A lower bound for the complexity of this local communication step is T,,,,,,, = 2(Sc + 2mTc), which holds under perfect load-balanced conditions, i.e., nk = N / P variables mapped to each processor and each processor has an equal number of F E nodes at its right and left boundaries. Each processor of a pair coupled to communicate with each other issues a nonblocking (asynchronous) send after issuing a nonblocking receive. This is done to ensure that a receive is already pending for the incoming long message so that it can be directly copied into the user buffer area instead of being copied to the N X I 2 area and then transferred to the indicated user buffer due to a late issued receive.
The F E nodes mapped to a processor can be grouped as infernaf nodes and boundary nodes [ Fig. 2(a) ]. Internal nodes are not connected to any F E nodes mapped to another processor and boundary nodes are connected to at least one F E node which is mapped to a neighbor processor. The internal sparse matrix-vector product required for updating the elements of the vector q h corresponding to the internal F E nodes, does not require any elements of the vector Pk which are mapped to the neighbor processors. Hence, the internal sparse matrix-vector product computation on each processor is initiated following the asynchronous local communication steps given above. Each processor can initiate the sparse matrix-vector product corresponding to its boundary F E nodes only after the two receive operations from its two neighbors are completed. Note that the synchronization on the asynchronous send operations can be delayed until the distributed vector update operations at Step 4 of the CG-SCG algorithm. This scheme is chosen to overlap the communication and the computation during the internal sparse matrixvector product. However, the percent overlap is measured to be below 5 percent due to the internal architecture of processors of the iPSCI2.
The EA algorithm described in the previous section is implemented to compute the two inner products in Step 2 of the CG-SCG algorithm. The volume of communication during the d concurrent nearest neighbor communication steps of the algorithm is only 16 bytes ( 2 DP words). The short messages are always stored first into the N X I 2 buffer regardless of a pending receive message. Hence. in the implementation of the EA algorithm on the iPSCI2. each processor of a pair participating in the exchange operation issues a nonblocking send operation before the blocking receive operation in order to prevent the delay of the send operation. The updating of the two partial sums is delayed until the completion of the send operation.
Each processor of the hypercube computes its own copies of the global scalars CY and in Step 3 of the algorithm in terms of the two inner products computed in Step 2 of the CG-SCG algorithm. Then, at Step 4 of the algorithm. each processor updates its own slices of the distributed x, r , and p vectors, without any interprocessor communication using these global scalars. Solution times for this implementation are given in Table 111 .
V. PERFORMANCE RESULTS ANI) DISCUSSION
This section presents overall performance results for the parallel CG-SCG algorithm. A simple model can be used to estimate an upper bound on achievable speedup. Given a 1 -D strip partitioning of a finite clement graph onto P processors so that Nk is the number of nodes mapped onto processor p h and V k is the number of values to be communicated by /'A, the iteration step time may be estimated as where T,, = ( z f 5)TCiglc is the execution time required for each locally mapped variable, Sc and T(-are the setup time and per value transmission time, and T:: is the global sum evaluation time using the Exchange-Add algorithm. e is the efficiency of the parallel implementation (spcedupIP). Since the iterations of the algorithm are synchronized, the bottleneck processor is the one with maximal sum of local execution cost (2. T,;Nk) and communication cost (T (.. vh) , where the local communication cost is modeled as described earlier in Section 111. In addition to local execution and communication costs.
