Abstract. Several approaches to finding the connected components of a graph on a hypercube multicomputer are proposed and analyzed. The results of experiments conducted on an NCUBE hypercube are also presented. The experimental results support the analysis.
Introduction
The problem of finding the connected components of an undirected graph arises in several applications. One of these is net extraction from circuit masks. A circuit mask may be modeled by an undirected graph in which the vertices represent mask polygons and edges join pairs of polygons that overlap. The connected components of the resulting graph represent the nets of the circuit realized by the mask. Once these nets have been extracted, they may be compared with a known correct set of nets to verify the correctness of the mask. The number of polygons in large masks is of the order of one million. Consequently, net extraction takes a lot of time on conventional computers. In case an error is found in the mask, the mask is corrected and net extraction is done again. This further increases the overall time spent verifying circuit masks. Because of this, the connected components problem is a good candidate for solution on a multicomputer.
In this paper we explore several ways to compute the connected components of a graph starting from its adjacency matrix representation. The objective is to develop an efficient algorithm for a hypercube multicomputer with a fixed number of processors. The algorithms we propose are first analyzed by using conventional measures such as asymptotic complexity, speedup, and efficiency, and also by a recently proposed measure of isoefficiency [Kumar et al. 1988 ]. The proposed algorithms are then evaluated experimentally on an MIMD hypercube multicomputer. A block diagram of such a computer is given in Figure 1 . The multicomputer has a host processor with local memory. The hypercube is attached to this host similar to a peripheral device. Each hypercube processor (called a node) has its own local memory. The hypercube is MIMD and all interprocessor communication and synchronization is done by explicit message passing. A program typically consists of a subprogram that runs on the host together with subprograms for each of the hypercube nodes. Often the same subprogram is run on each node. Our analysis of various parallel connected component algorithms shows that good performance cannot be expected by adapting the asymptotically efficient algorithms of [Dekel et al. 1981; Hirschberg et al. 1979; and Shiloach and Vishkin 1982] . Instead, to obtain good performance we need to use a parallel algorithm that does a total amount of work comparable to that done by the fastest uniprocessor algorithm.
Programming a multicomputer requires one to consider several factors that do not arise when one is programming a conventional uniprocessor computer. When programming a typical conventional computer, the initial algorithmic abstraction one begins with is, perhaps, the only significant consideration. For a multicomputer, however, many other factors can have considerable impact on the efficiency of the final program. Some of these are [Geist and Heath 1986 and Ranka et al. 1988 ]:
1. Algorithm selection 2. Partitioning and mapping 3. Overlapping computation and communication 4. Load balancing 5. Using the host Our development of hypercube algorithms for the connected components problem is organized around these factors. Before proceeding to the development of the connected component algorithms, we describe in section 2, the various measures used to evaluate multicomputer programs and algorithms.
Performance Measures
The performance of uniprocessor algorithms and programs is typically measured by their time and space requirements. For multicomputers, these measures are also used. We shall use tp and Sp to, respectively, denote the time and space required on a p node multicomputer. While Sp will normally be the total amount of memory required by a p node multicomputer for distributed memory multicomputers (as is our hypercube of Figure 1) , it is often more meaningful to measure the maximum local memory requirement of any node. This is so as, typically, such multicomputers have local memory on each processor of equal size.
To determine the effectiveness with which the multicomputer nodes are being used, one also measures the quantities speedup and efficiency. Let to be the time required to solve the given problem on a single node using the conventional uniprocessor algorithm. Then, the speedup, Sp, using p processors is
Sp = t~ " tp
Note that tl may be different from to since in arriving at our parallel algorithm, we may not start with the conventional uniprocessor algorithm.
The efficiency, Ep, with which the processors are utilized is P Barring any anomalous behavior as reported in [Kumar et al. 1988; Lai and Sahni 1984; Li and Wah 1986; and Quinn and Deo 1986] , the speedup will be between 0 and p and the efficiency between 0 and 1. To understand the source of anomalous behavior that results in Sp >p and Ep > 1, consider the search tree of Figure 2 . The problem is to search for a node with the characteristics of C. The best uniprocessor algorithm (i.e., the one that works best in most instances) might explore subtree B before examining C. A two-processor parallelization might explore subtrees B and C in parallel. In this case, t2 = 2 (examine A and C) while to = k where k-1 is the number of nodes in subtree B. So, $2 = k/2 and Ez = k/4.
One may argue that in this case to is really not the smallest uniprocessor time. We can do better by a breadth-first search of the tree. In this case, to = 3, t2 = 2, $2 = 1.5, and E2 = 0.75. Unfortunately, given a search tree there is no known method to predict the optimal uniprocessor search strategy. Thus in the example of Figure 2 we could instead be looking for a node D that is at the bottom of the leftmost path from the root A. So, it is customary to use for to the run time of the algorithm one would normally use to solve that problem on a uniprocessor.
While measured speedup and efficiency are useful quantities, neither gives us any information on the scalability of our parallel algorithm in the case when the number of processors/ nodes is increased from that currently available. It is clear that, for any fixed problem size, efficiency will decline as the number of nodes increases beyond a certain threshold. This is due to the unavailability of enough work, that is, processor starvation. To use increasing numbers of processors efficiently, it is necessary for the work load (that is, to) and hence problem size to increase also [Gustafson 1988 ]. An interesting property of a parallel algorithm is the amount by which the work load or problem size must increase as the number of processors increases in order to maintain a certain efficiency or speedup. [Kumar et al. 1988] have introduced the concept of isoefficiency to measure this property. The isoefficiency, ie(p), of a parallel algorithm/program is the amount by which the work load must increase to maintain a certain efficiency.
We illustrate these terms using matrix multiplication as an example. Suppose that two nxn matrices are to be multiplied. The problem size is n. Assume that the conventional way to perform this product is by using the classical matrix multiplication algorithm of complexity O(n3). Then, to = cn 3 and the work load is cn 3. Assume further that p divides n. Since the work load is easily evenly distributed over the p processors when p < n 2 tp = to + fcom, P where too,n represents the time spent in interprocessor communication. So, Sp = to/tp = pto/(to + ptoom) and Ep = Sp/p = to/(to + ptcom) = 1/(1 + ptcom/to). In order for Ep to be a constant, ptcom/tO must be equal to some constant 1/co So, to = work load = cn 3 = oqOtco m. In other words, the work load must increase at least at the rate otptco m tO prevent a decline in efficiency. If tco m is ap (a is a constant), then the work load must increase at a quadratic rate. To get a quadratic increase in the work load, the problem size n needs to increase only at the rate p2/3 (or more accurately, (aa/c)l/3p2/3). Barring any anomalous behavior, the work load to for an arbitrary problem must increase at least linearly in p since otherwise processor starvation will occur for large p and efficiency will decline. Hence, in the absence of anomalous behavior, ie(p) is f~(p). Parallel algorithms with smaller ie(p) are more scalable than those with larger ie(p).
The concept of isoefficiency is useful because it allows one to test parallel programs using a small number of processors and then predict the performance for a larger number of processors. Thus, it is possible to develop parallel programs on small hypercubes and also do a performance evaluation using smaller problem instances than the production instances to be solved when the program is released for commercial use. From this performance analysis and the isoefficiency analysis, one can obtain a reasonably good estimate of the program's performance in the target commercial environment where the multicomputer may have many more processors and the problem instances may be much larger. So with this technique, we can eliminate (or at least predict) the often reported observation that while a particular parallel program performed well on a small multicomputer, it was found to perform poorly when ported to a large multicomputer.
Algorithm Selection
As mentioned in [Ranka et al. 1988 ] the algorithmic abstraction that we begin with has a significant impact on the resulting hypercube program. The starting point of the program development process could be an existing parallel algorithm developed under the assumption that an unlimited number of processors are available, a parallel algorithm developed for a fixed number of processors, or some uniprocessor algorithm that has yet to be parallelized. In the best of situations, the development of a hypercube program would begin with a parallel hypercube algorithm developed for a fixed number of processors. We know of no such algorithm for the connected components problem.
Many researchers have developed parallel connected component algorithms under the assumption that an unlimited number of processors are available. [Carlson 1987; Gopalakrishnan et al. 1985; Hirschberg et al. 1979; Huang 1986; Nassimi and Sahni 1980; and Shiloach and Vishkin 1982] are some examples of such research. None of these algorithms provides a suitable starting point for our work. For example, consider the algorithm of [Shiloach and Vishkin 1982] . Their algorithm finds the connected components of an undirected graph with n vertices and e edges in time O(logn) using a CRCW shared memory computer with O(n + 2e) processors. This may be run on an O(n + 2e) processor hypercube by using the O(log2n) random access read and write algorithms of . The complexity of the resulting hypercube algorithm is O(log3n). On a uniprocessor, the connected components can be found in O(n + e) time, using either depth-or breadth-first search. For dense graphs, e = O(n 2) and the speedup, Sp=n2 = O(n2/log3n). (In all our speedup computations we shall use to = n z. This is justified since we assume an adjacency matrix representation. Even if an edge representation is used, we can justify this by restricting ourselves to dense graphs.) The efficiency Ep=n2 is O(1/log3n). Hence, efficiency declines to zero as p (and hence n) increases.
The processor-time product is a measure of the total work (useful and nonuseful) done by a parallel algorithm. The processor-time product of the O(n +2e) processor hypercube simulation of the algorithm of [Shiloach and Vishkin 1982] is O((n + 2e)log3n). For a dense graph, this is O(nZlog3n). The uniprocessor algorithm does only O(n 2) work. If we assume the constants of proportionality are the same in both cases, then the parallel algorithm is doing log3n times more work. Hence if n = 1024, then it would take log3n = 1000 processors just to break even with the uniprocessor algorithm running on a single processor. In practice, many more processors would be needed to break even as the constant of proportionality is much larger for the Shiloach-Vishkin hypercube adaptation (this comes from the increased constant factor for their algorithm; the constant factor associated with random access reads and writes; and the need for interprocessor communication which is typically far more expensive per unit than a basic arithmetic).
Dekel, Nassimi, and Sahni [Dekel et al. 1981] have developed an O(log2n) hypercube algorithm to find a spanning forest of an n vertex graph. This uses n3/logn processors. This algorithm may be adapted to find connected components in O(log2n) time. The processor-time product of this adaptation is O(n31ogn). For n = 1024, approximately nlogn = 10240 processors are needed to break even with the uniprocessor algorithm running on a single processor computer.
The starting point for our hypercube program is the relatively simple low overhead algorithm given in Figure 3 . This assumes a dense graph and an adjacency matrix representation. Each hypercube processor begins with a partition of the adjacency matrix. It computes a spanning forest under the assumption the graph has only those edges that are in its partition. The first step of this algorithm is the same as the data reduction step in the connected component algorithm proposed by Huang [1985] for the mesh-of-trees multicomputer. The details of the algorithm for step 1 are provided in Figure 4 (procedure Spanning Forest). The input to this procedure consists of the vertices V r represented by the rows of the adjacency matrix partition in the hypercube node and the vertices V c represented by the columns of this partition. The procedure uses a breadth-first traversal [Horowitz and Sahni 1986] . A depth-first traversal could also have been used.
The spanning forests define a relationship, R, between pairs of vertices, iRj if and only if i and j are in the same tree in at least one forest. The transitive closure of this relation may be computed using the union-find scheme discussed in [Horowitz and Sahni 1986] . This partitions the vertices into equivalence classes. Each such class defines a connected component. We shall refer to the process that results in the transitive closure of R as spanning structure merging. Define a spanning structure to be a collection of trees with the property that every graph vertex is in exactly one tree and if two vertices are in the same tree then they are in the same connected component of the graph. Note that the edges in a spanning structure are not required to be graph edges. Each of the p spanning forests computed in step 1 of Figure 3 is a spanning structure, in step 2 we begin with these p spanning structures and combine them pairwise (say) until just one spanning structure remains. This final spanning structure has the property that two vertices are in the same tree if they are in the same connected component. An example illustrating the two steps
Step 1:
Step 2:
Each hypercube processor computes a spanning forest based on the information in its adjacency matrix partition. This is done using breadth-first or depth-first search.
The hypercube processors merge their spanning forests to obtain the connected components. Procedure Spanning Forest (V,Vc) ; {Find spanning forest edges for the partition with row vertices V r and column vertices Vc} initialize queue empty; initialize all n vertices to be unmarked; for each vertex I in V do if vertex I is unmarked then {find a tree for l} begin add l to queue and mark it; while queue not empty do begin delete first vertex (say j) from queue; if jEv~ then scan row for vertex j else if jEV then scan column for vertex 3; all unmarked vertices k encountered during this scan are marked, edge (j,k) is output as part of the spanning forest, vertex k is added to the queue. end; { of while} end; { of then and for} end; { Spanning Forest} in our algorithm is given in Figure 5 . For this example the final spanning structure will consist of two trees; one with vertices 1, 2, 3, and 4 and the other with vertices 5 and 6. Figure 5 shows just one of the possible spanning structures with this property. The correctness of the algorithm of Figure 3 follows from the observation that in step 1 only edges that are on cycles are eliminated. This does not affect the connected components.
Step 1 requires a total of O(n 2) work. Since a spanning structure is a collection of trees, it can have at most n -1 edges. Combining two such structures takes slightly more than linear time if the union-find scheme of [Horowitz and Sahni 1986 ] is used. It takes linear time if the equivalence class algorithm of [Horowitz and Sahni 1986 ] is used. However, the latter scheme requires more memory. This becomes a problem in our case when testing with large graphs, so we do not use it. Sincep -1 pairwise merges of spanning structures are performed in step 2, the total work done in this step is O(npc~(n)), where o~(n) accounts for the fact that union-find takes slightly more than linear time (o~ is a functional inverse of the Ackermann's function). Since p is assumed fixed (or alternatively if we assume p~(n) ~ n), the total workload of Figure 3 is O(n2) . So this has a better potential of exhibiting good speedup for small p than the algorithms of [Shiloach and Vishkin 1982] and [Dekel et al. 1981] . 
Partitioning and Mapping
In this section, we shall consider two refinements of the algorithm of Figure 3 . In both, steps 1 and 2 are done in sequence (that is, step 2 commences after step 1 has completed).
In a later section, we consider another refinement in which steps 1 and 2 are done in parallel.
The two partitioning schemes of this section were used in [Jenq and Sahni 1987] for the all pairs shortest paths problem. Since in our hypercube model the memory is distributed across the nodes of the hypercube and it takes less time for a node to access its local memory than that of another node, it is necessary to distribute the adjacency matrix across the processor memories. The distribution schemes studied here, in effect, partition the matrix. However, a partitioning is not always as effective as a data distribution scheme that allows some data replication [Ranka et al. 1988] . Along with a data partitioning, one needs to provide a mapping of the data partitions to the processor memories. 
Partitioning By Stripes
In this case, an n• adjacency matrix is partitioned into p stripes, each stripe composed of n/p contiguous rows. Figure 6 shows the partitioning and processor mapping for the case n = 32 and p = 8. In this figure, Pi denotes processor i of the hypercube. To compute the connected components, each processor first computes a spanning forest of the given n vertex graph. This spanning forest is computed using procedure Spanning Forest (Figure 4 ) with Vr = {(i -1)n/p .... , in/p -1} for processor Pi and Vc = {0, 1, ..., n -1} for all processors. The merging of the spanning structures is done pairwise as indicated in Figure 7 for the case p = 8. Figure 8 shows the hypercube communication paths. Processor P0 is involved in three stages of merging. First, it merges its step 1 structure with that of P1. For this, P, must transmit its spanning structure information to P0. Next, it merges this spanning structure with the merge of the step 1 spanning structures of P2 and P3..For this, P2 communicates appropriate information to Po. Finally, Po merges the merged step 1 spanning structure of P0 through P3 with that of P4 through PT. The overall spanning structure resides in Po. At this point, each vertex determines the root of the spanning structure tree it is contained in. This is its connected component identifier. Notice that the number of active processors reduces by half following each merge step.
When merging two spanning structures A and B, we take the at most n -1 edges in one (say A) and merge with those of the other (say B). For each of the edges in A two finds are performed to see if the two vertices that are the end points of this edge are already in the same tree of B. If they are not, then the two trees of B that contain these vertices are united. If the two vertices are in the same tree of B, no union is performed. Hence a pairwise merge requires at most 2(n -1) finds and n -1 unions. Since a processor's adjacency matrix partition has nip rows of n bits each, step 1 takes O(n2/p). More accurately, in the worst case the nip rows of the partition will be scanned in the then clause of Figure 4 and the n -nip columns that correspond to the n -n/p vertices in Vc -V~ scanned in the else clause. So, a total of n2/p -I-(n --n/p)n/p = 2nZ/p -n2/p 2 accesses to the processor's adjacency matrix partition are made. Hence, step 1 takes n2/p(2 -1/p)t s time (ts is a constant). There are logp merge stages, with each taking at most (n -1)t m time (for simplicity, we assume that 2(n -1) finds and n -1 unions can be done in O(n) time; the union-find algorithms described in [Horowitz and Sahni 1986 ] take slightly more time; linear time can be achieved using the equivalence class algorithm of [Horowitz and Sahni 1986] ; tm is a constant). Note that a spanning structure of an n vertex graph/subgraph can contain at most n -1 edges. Each communication of a spanning structure takes at most ~ + (n -1)tc worst case time, where c~ is the communication startup time and tc is a constant. to be constant. ' For this, the problem size, n, must grow at the rate f~(p logp). The work load, n 2, must therefore grow at the rate f~(p2 log2p). Hence, the isoefficiency is ~2(p 2 log2p and n -n/p columns of V c -Vr in step 1. These graphs have the property that in at least one stripe each vertex in V c -V r is adjacent to at least one vertex in Vr. Note that as the edge density increases, the probability of this happening also increases. Further if this property is satisfied, the graph is connected. However, a connected graph need not satisfy this property. For graphs with this property, E~mPeS< 2/3 for p = 2, 4/7 for p = 4, 8/15 p for p = 8, etc. Note that on graphs that do not satisfy the stated property the efficiency can be higher. For sufficiently dense graphs these bounds can be expected to apply as such graphs satisfy the above property.
2. Partitioning By Rectangles
Partitioning by rectangles is an alternate/Tt~ partitioning by stripes. The adjacency matrix is partitioned into p rectangles of size ~ x ~ where p = 2 d. Figure 9 shows the partitioning and processor mapping for the case n = 32 and p = 8. The mapping is designed to optimize the spanning structure mergings of step 2. While this partitioning is the same as that used in [Jenq and Sahni 1987] for the all pairs shortest paths problem, the mapping to processors is different. The spanning structure merge order is shown in Figure 10 . This merge order minimizes the spanning structure size following each merge. 2d/2 <6n edges are transmitted. Note that 2 a/2 x = 2n > n -1. However, a spanning structure can have at most n -1 edges 9 So, the above bound is quite loose. A similar analysis shows that 6n bounds the total data transmission when d is odd. Also, since at most n -1 edges may be in a spanning structure, we obtain min{6n, (n -1) logp} as a bound on the total number of edges transmitted by any one node. Hence the worst case time complexity is 2/,/2 trectangle s = ~--t s + min{6n, (n -1) logp} tm+ min{6n, (n -1) logp} tc + c~ log/).
Comparing with tstripes, we see that the worst case step 1 time for the stripes method is less than that for the rectangles method by n2/p 2. The step 2 time for the stripes method is never less than that of the rectangles method. In fact, when 6n < (n -1) logp (or approx-imately when p > 64), the step 2 time for the rectangles method is less than that for the stripes method. As noted earlier, our 6n bound is quite loose and we expect the step 2 time for the rectangles method to be less than that for the stripes method even when p is less than 64.
The speedup and efficiency for the rectangles method are 
P~ [min{6n, (n -1) logp}(tm + tc) + ~ logp] n2ts
must be constant. When min{6n, (n -1) logp} = (n -1) logp, the isoefficiency is the same, 9(pa log2 p), as that of the stripes method. When min{6n, (n -1) logp} = 6n,
P [6n(tm + t c) + c~ logp] n2ts
must be constant. For n >> logp, this requires that n grows as f~(p). Hence the work load, n 2, must grow as ~2(p2). Thus the isoefficiency of the rectangles method is between f2(p 2) and fl(p21ogZp).
From the equation for Erp ectangles we see that for worst case data, UpeCtangles< 1/2. We expect this bound to apply for sufficiently dense graphs.
Experimental Results
FORTRAN programs to find connected components using the stripes and rectangles partitioning schemes were run on an NCUBE hypercube multicomputer. For each n, 30 random graphs with edge density ranging from 70% to 90% were generated. The average efficiency is given in the tables of Figures 11 (stripes partitioning) and 12 (rectangle partitioning). As predicted by our isoefficiency analysis, the problem size n needs to more than double each time the number of processors doubles in order for the efficiency to not deteriorate. For example, the stripes method has an efficiency 0.2 when n = 64 and p = 8. To obtain this same efficiency when p = 16, we need n to be greater than 128. The problem size increase required by the rectangles method is not as great as required by the stripes method (though n still must more than double each time p doubles). Figure 12 . Efficiency of rectangles partitioning.
Our analysis indicated that the efficiency would be less than 2/3 for the stripes method when the test graphs required the examination of all rows of Vr and all columns of V C -Vr. The table of Figure 11 has a few entries with efficiency greater that this. This indicates that our average test graph did not require all these rows and columns to be examined. While not shown in the table, we observed that the efficiency came closer to that predicted by our analysis as the edge density was increased. As p increases, the efficiency declines because of an increase in the inter-processor communication overhead. For the rectangles method, again, some efficiencies exceed the 0.5 bound expected for worst case data. This reflects the fact that our test graphs were not worst case graphs.
Also, our analysis indicated that the step 1 time for the stripes method is less than that for the rectangles method. However, the step 2 time for the rectangles method is generally less. This differential in step 2 time increases with p. Hence, we expect the stripes method to outperform the rectangles method for large n and small p. This expectation is reflected in the data of 
Overlapping Computation and Communication
The refinements of the preceding section make no attempt to overlap the time spent computing with that spent transmitting data. Figure 16 shows the activities of processor Po of Figure  7 . We can attempt to reduce the overall time by overlapping data transmission and computation. For this, the odd-numbered leaf processors of Figure 7 must transmit their spanning structure edges in packets concurrent with the computation of the spanning structure. If we are sending packets of size s edges, s < n, then as soon as s structure edges are selected, a transmit is initiated. This requires a slight modification in the merging process so that it commences as soon as the first packet is received. Similarly, during each merging step, the merged structure is transmitted as a series of packets. If n -1 edges are to be transmitted in packets of size s, then the total transmission time becomes (n -1)(a + Stc)/S. While this is larger than the c~ + (n -1)tc time needed to send the (n -1) edges as one packet, we can accomplish a reduction in the overall run time since the transmission may be substantially overlapped with the step 1 time and the merge times. A reduction will be seen only if the total wait time decreases.
For the connected components problem, the total wait time is O(n logp) while the computation time is O(n2/p). So, even if the wait time was reduced to zero, there would not be much difference in the overall time. Figure 17 shows the percentage of change in run times of the two schemes of Section 4 when the computation/communication strategy is implemented. The packet size used was 500 edges. As is evident, the overlapping strategy does not have much impact on the total run time. In fact, a reduction is seen only for large n. For the stripes method we were unable to make n sufficiently large to observe a run time reduction except for the case p = 2. When n is large, the step 1 time is large and transmitting by packets effectively overlaps the computation of the spanning structure. When n is small, the step 1 time is small and hence not enough to reduce the wait time. It should be emphasized that in problems where the communication and computation times are comparable, successful overlapping of these can significantly reduce the overall run time. In fact, Won and Sahni [1987] report a 23 % reduction in the case of the maze routing problem.
Number of processors (p)
Size ( 
Load Balancing
The strategy to overlap computation and communication may be taken one step further to perform steps 1 and 2 of Figure 3 in parallel. For this, some of the processors are assigned the task of finding spanning structures and the others the task of merging spanning structures. Since the strategy of Section 5 transmits spanning structures in packets as these are generated, it is possible for the merge processors to begin their work before the step 1 computation of a spanning structure has been completed 9
Let us take a closer look at how this may be done for the stripes partition. One possibility is to partition the adjacency matrix into -~ x ~-squares as in Figure 18 . Since the adjacency 9 . .P P . .
matrix of an undirected graph is symmetric, only those squares on or above the mare diagonal are needed. The processors are grouped into pairs and each processor is assigned a row of squares as in Figure 18 . The processors begin by computing a spanning structure for their diagonal square. Then the even processors transmit their structures to the odd processor in their respective pairs. The odd processors merge spanning structures while the even ones continue to process their squares (only diagonal squares and those to their right are processed) and transmit the resulting spanning structure edges to their odd partners. This processing of squares consists of the two steps: a) perform a breadth-first traversal of the square retaining edges that form a spanning structure for the square, b) process these spanning structure edges using the union-find algorithm of [Horowitz and Sahni 1986 ] to eliminate edges that form a cycle when considered with those edges already transmitted to the odd partners.
Only edges that survive step b) above are actually transmitted to the odd partners. The even processors do not transmit the spanning structures of their last q squares to their odd partners (the optimal value of q is to be determined experimentally). On completing all merges, the odd processors begin to process their squares and transmit spanning structures to their even partners. The even processors begin to merge after completing the processing of their remaining squares. Once an even processor has finished all its work, the resulting spanning structure is transmitted to the even processor in the upper adjacent pair (i.e.,
processor P2i transmits to processor P2(i_I)) for merging. Figure 19 illustrates the data transmission sequence for the case p = 8. Notice that since P6 will finish first, some or all of its transmission to P4 will be overlapped with work still being done by/'4. This overlapping could also take place between P2 and P4 and Po and P2.
The time, tsquares, required by this strategy is given by tsquare s = To -  where To is the time taken by Po to finish its work for the pair (Po,/'1); Twait is the time Po has to wait following To for the merged data to arrive from P2; and Tmerge is the time to merge this data. The success of this strategy depends considerably on the matching of interprocessor communication times with intraprocessor computation times. Unfortunately, for large n, the time needed to compute the spanning structure of a square is much greater than the time needed to transmit and merge the structure. Hence the merging processors are often idle. To remedy this load imbalance, the group size may be increased to k, k > 2, processors. In each group of k processors, one processor merges while the remaining k -1 compute spanning structures. This also reduces the right-most path length of Figure 19 . Notice that this length is p/k. When k> 2, the merge processor of a group only merges structures and the adjacency matrix data for the group is distributed evenly over the remaining k -1 processors in the group (again, only data in the upper triangle is needed). Figure 20 gives the ratio tsquares/tstripe s for the graphs used in the experiment of Section 4.3. The number in parenthesis is the optimal value of k. Note that when k = 2, the pairing strategy described in the beginning of this section is used. It was experimentally determined that the best value of q is p, the number of hypercube processors. In this case, the even processor in each pair obtains a spanning forest for all its squares together. The fact that q = p gave the best performance may be attributed to the relative high cost of interprocessor communication. The odd processors work one square at a time and transmit edges to their even partners.
For any fixed hypercube size, the optimal k increases as n increases. This is because the time needed to compute the initial spanning structures increases quadratically in n while the merge time increases linearly in n. So a merge processor can handle more merge load in the time required by the spanning structure processors to compute these structures. Because of the unpredictable nature of the computation/communication overlap in this scheme, the scheme is hard to analyze with a view to predicting the optimal k. The efficiency table is given in Figure 21 . Since the spanning structure time increases asymptotically Figure 21 . Efficiency of the squares method.
faster than the merge time, for any fixed number p of processors the ratio tsq,ares/tstripe s first decreases as n increases and then increases as n increases. When the ratio is decreasing the overlapping of computation and communication is the dominating factor. However, eventually the increased computation load of the squares method dominates and the ratio begins to increase. The squares scheme just described uses only the upper triangles of the adjacency matrix. One may consider developing a program that does this without performing steps 1 and 2 of Figure 3 in parallel. In some sense, this represents the case k = 1 with the final merging stage replaced by a binary tree merge as in Figure 7 . Since the number of bits in the upper triangle is n-I i = n(n -1)/2 i=i (note that all diagonal bits are 0 and need not be considered), for good load balancing n(n -1)/2p bits are resident with each processor initially. This also equalizes the processor memory requirements.
The worst case merging and communication time requirements of this balanced triangle scheme are the same as those of the stripes method. The worst case step 1 (cf. Figure  3 ) time is n(n -1)/p since each bit in a node's partition may be examined twice. This results from a need to implement Figure 4 so that when a partition row is scanned, the row segment that is to the left of the diagonal is obtained by scanning the corresponding column segment above the diagonal. So, we obtain is required to be constant (assuming -1). Hence the isoefficiency is g(p2 log2p).
n This is that same as that for the stripes method. Using the same test set as before, we obtain the efficiencies given in Figure 22 . The plots of Figures 13-15 are extended in Figures 23-25 to include the speedups for the squares and balanced triangle schemes. For large n, the balanced triangle method is the fastest for small p (p_<16) and the rectangles method is fastest for large p (p> 16). 
Using the Host
One may consider utilizing the processing capabilities of the host to assist in the computation of the connected components. One possibility is to let the host perform the merging step (step 2) of Figure 3 . The hypercube processors perform step 1 and transmit the spanning structures to the host in packets. A packet is transmitted as soon as it is created. As a result, the host begins merging sooner than step 2 can commence when the stripes or rectangles method is used as in Section 4. Further, the transmission of the spanning structures is overlapped with their computation and merging. For small n and p, we do not expect this utilization of the host to perform better than the raw schemes of Section 4 because of the overhead of communicating with the host and the lack of sufficient merging work. For large n, the merging load is too large for the single host processor to outperform merging by p processors. However, there may be an intermediate range where using the host results in improved performance.
We experimented with the above scheme using both the stripes and rectangles schemes of Section 4. The results of our experiments are given in Figure 26 . As is evident, utilizing the host improves performance for n in a suitable range. This range itself changes with p. For larger p (_> 32 for stripes and _> 64 for rectangles) we found no n for which the host could be used in the above manner to improve performance. The optimal packet size to use was found experimentally. This size increases with p. Figure 26 . Results for the stripes and rectangles schemes.
Conclusions
We have studied several ways to compute the connected components of an undirected graph on a hypercube multicomputer. Starting from the same algorithmic abstraction, one can arrive at programs with different performance depending on the manner in which one partitions and maps the problem, whether or not one attempts to overlap computation and communication, and the attention one pays to load balancing. Of the various methods studied, the balanced triangle scheme of Section 6 performed best. Since our programs have good isoefficiency, we expect them to perform well also on hypercubes of much larger size than tested here, provided problems of a sufficiently larger size are solved. The required larger size may be predicted using the isoefficiency of the algorithm.
