AbstractÐMesh-connected computers (MCCs) are a class of important parallel architectures due to their simple and regular interconnections. However, their performances are restricted by their large diameters. Various augmenting mechanisms have been proposed to enhance the communication efficiency of MCCs. One major approach is to add nonconfigurable buses for improved broadcasting. A typical example is the mesh-connected computer with multiple buses (MMB). We propose a new class of generalized MMBs, the improved generalized MMBs (IMMBs). We compare IMMBs with MMBs and a class of previously proposed generalized MMBs (GMMBs). We show the power of IMMBs by considering semigroup and prefix computations. Specifically, as our main result we show that for any constant H``I, one can construct an x I P Â x I P square IMMB using which semigroup and prefix computations on x operands can be carried out in yx time, while maintaining yI broadcasting time. Compared with the previous best complexities yx I V and yx I IT achieved on a rectangular MMB and GMMB, respectively, for the same computations, our results show that IMMBs are more powerful than MMBs and GMMBs.
I
P Â x I P square IMMB using which semigroup and prefix computations on x operands can be carried out in yx time, while maintaining yI broadcasting time. Compared with the previous best complexities yx I V and yx I IT achieved on a rectangular MMB and GMMB, respectively, for the same computations, our results show that IMMBs are more powerful than MMBs and GMMBs.
Index TermsÐBus, mesh-connected computer, mesh-connected computer with multiple buses, parallel algorithm, parallel architecture, parallel computing, processor array. ae
INTRODUCTION
A MONG various parallel architectures, mesh-connected computers (MCCs) have received considerable attention. The processors in an MCC are arranged as a processor array, and each processor is connected to its nearest neighbors. Due to its simple and regular interconnection pattern, an MCC is feasible for hardware implementation and suitable for solving many problems such as matrix manipulation and image processing. However, the relatively large diameter of an MCC causes a long communication delay between processors that are far apart. The time complexities of algorithms running on an MCC are lower bounded by its diameter. To overcome this problem, various augmenting mechanisms have been proposed to enhance the communication efficiency of MCCs. One major approach is to add buses for improved broadcasting [1] , [7] , [8] , [9] , [13] , [15] , [19] , [21] , [22] , [23] , [24] . A typical example is the mesh-connected computer with multiple broadcasting (MMB) [21] . A two-dimensional MMB is a two-dimensional (2D) MCC with a bus for each row and each column. Fig. 1 shows a R Â R 2D MMB.
In this paper, we propose a class of improved generalized mesh-connected computers with multiple buses (IMMB). We compare the performances of IMMBs and MMBs by considering parallel semigroup and prefix computations. Semigroup computations are an important class of computation problems. Examples include computing sum, product, minimum/maximum, Boolean parity, AND, and OR. Prefix computations are related to semigroup computations; they have a wide range of applications such as processor allocation, data distribution and alignment, data compaction, job scheduling, sorting, packet routing, string matching, lexical analysis, matrix computation, linear recurrence, polynomial evaluation, graph algorithms, general Horner expressions and general arithmetic formulae. Refer to [2] , [6] , [16] , [17] for references of these applications. Efficient semigroup and prefix algorithms serve as important primitives for parallel computing.
Various algorithms for semigroup and prefix computations on different machine models have been proposed in the literature. Kumar and Raghavendra [21] showed that semigroup computations on x operands can be performed using an x I P Â x I P square MMB in yx I T time. Chen et al. [11] later showed that prefix computations on x operands can be performed using on an x I P Â x I P square MMB in the same amount of time. They showed that if the MMB has a rectangular shape, i.e., the sizes of two dimensions are not the same, better complexity can be achieved. In particular, they showed that semigroup and prefix computations on IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 3, MARCH 2001 293 x operands can be performed using an x S V Â x I V MMB in yx I V time. Chung [13] proposed a generalized MMB (GMMB) architecture in which k P n I Â n P MMBs, arranged as a k Â k array, are connected by local links (defined in the next section). By selecting k x I IH , n I x I P , and n P x Q IH , he showed that semigroup and prefix computations can be performed using an x Q S Â x P S GMMB in yx I IH time, which is the best possible because the diameter of this generalized MMB is yx I IH . The major drawback of this GMMB is that broadcasting performance is sacrificed by a factor of yx Ia , where R, for improved semigroup and prefix computation performances with increased number of buses, compared with MMB. Semigroup and prefix computations using d-dimensional MMBs and GMMBs have also been considered. In [3] , [11] , it was shown that semigroup and prefix computations can be performed on x operands using an x dP dÀI I dP d
In [13] , it is shown that semigroup and prefix computations can be performed on x operands using an x dP dÀI P
Define the aspect ratio of a n I Â n P Â Á Á Á Â n d d-dimensional mesh as
We call a d-dimensional mesh a square mesh if its aspect ratio is 1. The performances claimed in [3] , [11] , [13] are only valid for meshes with very large aspect ratios. Dighe, Vaidyanathan and Zheng proposed a multiple-bus architecture called busconnected ringed tree (BRT) in [15] . In a BRT, each processor is connected to two buses, and all the buses have the same size, which is the number of processors connected to a bus. They showed that a 2D BRT, which is a 2D processor array with multiple buses, can simulate a mesh-of-trees efficiently. Based on the prefix sum algorithm of [17] for a tree, a 2D BRT can carry out a prefix computation in ylog x P time. Better performances are possible if switches are added to make buses reconfigurable. For example, prefix and semigroup computations can be easily carried out in ylog x time using an MBB whose buses can be partitioned into segments by switches ( [22] ) or a reconfigurable mesh ( [19] ). As a special case, Olariu et al. [20] showed that the prefix sum operation on x integer operands in the range HY x À I can be performed on an x Â x reconfigurable mesh in yI time. In this algorithm, the number of processors used is much larger than the number of operands, and dynamically reconfigured paths are used as an integral part of computation. This technique cannot be generalized to solve general semigroup and prefix problems with the same time complexity.
Like the MMBs of [21] and GMMBs of [13] , buses of the IMMBs proposed in this paper do not have switches on them. The major difference between our IMMB architectures and existing MMB-like architectures is that the buses in our architectures are partitioned into levels, while maintaining that each processor is connected to exactly two orthogonal buses. In an l-level IMMB, buses are partitioned into l levels of different spans. The diameter of a d-dimensional l-level IMMB (called (dY l)-IMMB) is dl. In Section 2, we define the two-dimensional IMMBs. In Section 3, we show that semigroup and prefix computations on x operands can be performed using an x I P Â x I P square PY P-IMMB in yx I IT time. We would like to point out that a PY P-IMMB can simulate its corresponding GMMB proposed in [13] with a constant factor of slowdown, while having fewer buses. In terms of number and size of buses, an IMMB is a trade-off of an MMB and a GMMB. The performance of of an IMMB is better than that of an MMB and GMMB. Further performance improvement can be achieved by increasing the number of levels. In Section 4, we show that for any constant H``I, there exists a multilevel x I P Â x I P square IMMB using which semigroup and prefix computations on x operands can be carried out in yx time, while maintaining yI broadcasting time. We also show how to construct an l-level x I P Â x I P square IMMB, where l ylog x, on which semigroup and prefix computations on x operands, and data broadcasting all take ylog x time. The generalization of these results to d-dimensional IMMBs is discussed in Section 5. The major results presented in this section is that, one can perform semigroup and prefix computations on x operands in an x-processor dY l-IMMB in yx I ldP d time. When selecting l d, one can always obtain a d-dimensional square IMMB. We conclude the paper in Section 6 by discussing the implications of our results.
TWO-DIMENSIONAL IMMBs
A two-dimensional IMMB is a two-dimensional meshconnected computer (MCC) augmented with buses. We call the links for the mesh connections local links. The added buses are divided into l levels, which form a hierarchical structure. A 2D IMMB is formally defined as follows.
An sIY n IYI Y n IYP , a one-level IMMB, is an n IYI Â n IYP MMB. Processors that are in the boundary rows and columns are called boundary processors. to connect boundary processors of level-l À I submeshes to enforce nearest-neighbor mesh connections. For easy references, these local links are referred to as level-l bridge local links. For each row (respectively, column) of the n lYI Â n lYP array of level-l À I submeshes, we do the following: Merge the topmost (respectively, left most) row (column) buses, which were level-l À I buses, of these IMMBs into one bus. The buses obtained by these merging operations are called the level-l buses of slY n IYI Y n IYP Y n PYI Y n PYP Y Á Á Á Y n lYI Y n lYP , and they are no longer level-l À I buses. The remaining level-k buses, I k l À I, of the n lYI Â n lYP component level-l À I submeshes are called the level-k buses of slY n IYI Y n IYP Y n PYI Y n PYP Y Á Á Á Y n lYI Y n lYP . An l-level 2D IMMB is also referred to as a PY l-IMMB. To avoid degeneracy, we assume that n iYI ! Q and n iYP ! Q for I i l. Define I n IYI n IYP and i n iYI n iYP iÀI À n iYI n iYP À I À n iYP n iYI À I n iYI n iYP iÀI À P n iYI n iYP for I`i l. Clearly, l is the number of buses in Fig. 2 shows the structure of sQY QY QY QY QY QY Q, a PY Q-IMMB. An IMMB is represented by a hypergraph q, whose vertices and hyperedges correspond to the processors and connections (buses and local links), respectively. Using the hypergraph theory ( [4] ), one can derive many topological properties of IMMB. We refer readers to [26] for the analysis of hypergraph-based interconnection structures.
In an slY n IYI Y n IYP Y n PYI Y n PYP Y Á Á Á Y n lYI Y n lYP , there are x l iI n iYI n iYP processors arranged as a l iI n iYI Â l iI n iYP processor array. It contains a l iI n iYI Â l iI n iYP MCC (connected by local links) as a substructure; i.e., after removing buses, we obtain an MCC. In addition to local links, each processor is connected to exactly two buses.
Each bus belongs to a unique level. It is important to note that a level-k bus, k b I, is shared by processors from several level-k À i submeshes, where I i k À I, so it cannot support concurrent data transmissions among the level-k À i submeshes connected by it. In some situations, as will be shown shortly, concurrent data transmissions on such a bus can be simulated using other buses with only a small constant slowdown factor.
Throughout this paper, we adopt the same assumptions that have been used in all previous work on MMBs and their variants. We assume that it takes constant time to broadcast a message on a bus as in [1] , [7] , [8] , [9] , [13] , [15] , [19] , [21] , [22] , [23] , [24] . To ensure conflict-free accesses of buses, each processor is equipped with an off-line circuitry so that bus allocations, although operated dynamically, are predetermined by an off-line scheduling algorithm which is known at compilation time. The bus accesses are compiled in advance so that no two processors attempt to use the same bus at the same time. With these assumptions, bus arbitration overheads are ignored, algorithm analysis is simplified, and comparing different algorithms becomes easier. The complexity of an algorithm is measured by the total number of parallel computation steps and parallel communication steps. These assumptions are adopted in all previous work on processor arrays with synchronous buses. Let m l iI n iYI and n l iI n iYP . We use iYj , I i m and I j n, to denote the processor in row i and column j. Consider the diameter of slY
there is a path of length at most 2 using level-l buses. Routing a message from a processor i H Yj H to a processor i HH Yj HH using buses can be done as follows: If i H Yj H and i HH Yj HH are in the same level-l À I submesh, then we only need to consider routing within this submesh. If they are in two different level-l À I submeshes, then we find a path from the level-l À I submesh w H that contains 
Our two-level IMMBs closely resemble the GMMBs proposed in [13] , which are constructed using multiple copies of MMB by only introducing additional local links.
A W Â W two-dimensional GMMB is shown in Fig. 3 , and its corresponding IMMB is shown in Fig. 4 . Define an r Â s logical MMB on a subset of rs processors of an IMMB as a substructure of the IMMB that can simulate each parallel step of an r Â s MMB, with its processors being in , in constant time. We can use a PY P-IMMB, sPY n IYI Y n IYP Y n PYI Y n PYP , to simulate its corresponding 2D GMMB of [13] as follows. In each level-1 submesh, we use the level-1 bus that connects the second processor row (respectively, column) to simulate a level-1 bus that connects all processors in the first row (respectively, column). For example, suppose that processor iYj in the first row (respectively, column) of a level-1 submesh wants to broadcast a message to all processors in the same row (respectively, column) of the same submesh. It can first send the message to processor iIYj (respectively, iYjI ) in the second row (respectively, column) using a local link, and then broadcast it to all processors in that row (respectively, column) using the level-1 bus that connects them. After this, each processor in the second row, (respectively, column) sends the received message to the corresponding processor in the first row via a local link. This scheme implies that there exists n PYI n PYP disjoint n IYI Â n IYP logical MMBs defined on the level-1 submeshes, and each n IYI Â n IYP MMB substructure of the 2D GMMB can be simulated by a logical MMB. This leads to the following claim. Theorem 1. A PY P-IMMB can simulate its corresponding 2D GMMB with a constant-factor slowdown.
Obviously, the converse of this theorem is not true. This is because the diameter of the 2D GMMB corresponding to sPY n IYI Y n IYP Y n PYI Y n PYP is Pn PYI n PYP À I, whereas the diameter of sPY n IYI Y n IYP Y n PYI Y n PYP is 4. We wlll show that IMMBs are more powerful than MMBs and GMMBs by considering semigroup and prefix computations.
Let us compare the structures of two-dimensional MMB, GMMB, and IMMB of the same size. A common feature of MMB, GMMB, and IMMB is that all of them contain an MCC as a substructure, and each processor is connected to exactly two buses. Define the size of a bus as the number of processors connected by the bus. All buses in an MMB and a GMMB are of the same size, whereas the buses in an IMMB have variable sizes. The largest bus size of a GMMB is the same as the size of buses in an MMB. In terms of semigroup and prefix computations, a GMMB improves the performance of an MMB by adding more buses, and IMMB improves the performance of a GMMB by allowing variable bus sizes. It is important to note that the improved performance of an IMMB over a GMMB can even be achieved by using a smaller number of buses, as in the case of two-level IMMBs.
SEMIGROUP AND PREFIX COMPUTATIONS ON A (2, 2)-IMMB
In this section, we consider semigroup and prefix computations using a (2, 2)-IMMB, a 2D two-level IMMB. A semigroup computation is formally defined by a tuple ÈY , where È is an associative operator, and
g is a set of operands. This tuple specifies computation I È P È Á Á Á È x . A prefix computation is also defined by a tuple ÈY , where È is an associative operator, and I Y P Y Á Á Á Y x is a sequence of operands. This tuple specifies computations
We assume that the operation È performed on two operands takes constant time. Since the result s x of a prefix computation is a result of a semigroup computation, any algorithm for prefix computations can be used for a semigroup computation with the same complexity. Thus, we only need to discuss prefix computations. By Theorem 1 and the result of [13] on GMMBs, semigroup and prefix computations on x operands can be done using sPY x Consider sPY n IYI Y n IYP Y n PYI Y n PYP such that n IYI n PYP n I and n IYP n PYI n P . Clearly, sPY n IYI Y n IYP Y n PYI Y n PYP consists of x n P processors arranged as an n Â n square array, where n n I n P . In another view, sPY n IYI Y n IYP Y n PYI Y n PYP is an n P Â n I array of n I Â n P level-1 submeshes. We call the processor in such a submesh that has the largest index according to lexicographical order the leader of the submesh. We observe that the leaders of level-1 submeshes are processors in I Yjn P , where I i n P and I j n I , and they form an n P Â n I array. It is simple to see that the leaders inIYknP and in I YkIn P (respectively, kn I Yjn P and kIn I Yjn P ) of two adjacent level-1 submeshes are connected by the following path:
ridge link in I Ykn P I À3 levelÀI us in I YkIn P respetivelyY knIYjnP À3 ridge link knIIYjnP À3 levelÀI us kInIYjnP X Furthermore, the kth level-2 row (respectively, column) bus can be used to simulate a bus that connects kth leader row (column). For example, suppose that a submesh leader wants to broadcast a message to all leader processors in its row (respectively, column). It can send the message to the processor in the first row (respectively, column) of its submesh via the level-1 column (respectively, row) bus it is connected to, and use the level-2 bus to broadcast the message to all processors in this row. Then, processors in this row (respectively, column) send the received messages to their corresponding processors (leaders) in the last row (respectively, column) of the submesh via level-1 column (respectively, row) buses. Thus, sPY n IYI Y n IYP Y n PYI Y n PYP contains a logical n P Â n I MMB defined on the level-1 leaders.
Let I Y P Y Á Á Á Y n be a sequence of n n I Â n P operands for a prefix computation, and e be a prefix algorithm that runs in ytn time on an n I Â n P MMB with each processor holding one operand. Suppose that f X fijI i ng 3 fjY kjI j n I Y I k n P g is the function used by algorithm e to map i s and s i s to processors; i.e., if fi jY k, input i and result s i I È P È Á Á Á È i are in jYk before and after the computation, respectively. We want to perform prefix computation
Recall that sPY n PYI Y n PYP Y n IYI Y n IYP consists of an n P Â n I array of level-1 submeshes, each being an n I Â n P processor array. Denote these submeshes as w kYj 's, where I k n P and I j n I . Submesh w kYj consists of processors Y , k À In I I kn I and j À In P I j jn P . Define g X fijI i ng 3 fkY jjI j n I Y I k n P g such that gi kY j if fi jY k. We use gi to map e i s to w kYj s. For each e i , we map its iÀInq to processor j q Yk q , where I q n, and j q Y k q fq. In other words, initially processor kÀInIj H YjÀInPk H , which is in submesh w kYj , stores iÀIni H , where kY j fi and j H Y k H gi H and I iY i H n. With this data distribution, the prefix computation using an sPY n IYI Y n IYP Y n PYI Y n PYP can be carried out in the following four steps.
1. Execute algorithm e on n level-1 submeshes concurrently to compute local prefixes so that
For each submesh w kYj , store its computed i in its leader processor. By Theorem 1,
can simulate n disjoint n I Â n P wwfs with only a constant-factor slowdown. Hence, these operations take ytn time.
2.
Execute algorithm e on a logical n P Â n I MMB with the leaders of level-1 submeshes as its processors. The computed result stored in the leader of w kYj is i I È P È Á Á Á i , where kY j gi. This step takes ytn time.
3. The leader of each submesh w kYj , where kY j gi, broadcasts the value iÀI , which can be computed from i and i , to all processors in w kYj . Operating in parallel, this can be done in yI time since sPY n IYI Y n IYP Y n PYI Y n PYP can simulate n independent wwfs. 4. Each processor performs operation È on the local prefix computed in Step 1 and the value it received in Step 3. This takes yI time. In summary, we have the following result.
Theorem 2. If a prefix (respectively, semigroup) operation on n operands can be carried out in ytn time using an n-processor MMB, then the same prefix (resp semigroup) operation on n P operands can be carried out in ytn time using an n P -processor square PY P-IMMB.
Since the prefix and semigroup computations on n operands can be performed using an n If we let each processor hold more than one operand, semigroup and prefix computations may be performed more efficiently. To see this, let us distribute n P tn operands to n P processors such that each processor holds tn operands. In parallel, each processor performs prefix (or semigroup) computation on its own tn operands sequentially. The total parallel time for this process is ytn. Then, the parallel operations described above are performed on the partial results. Since the product of time and the number of processors is yn P tn, this computation is cost optimal.
Theorem 3. If semigroup and prefix computations on n operands can be carried out in ytn time using an n-processor MMB, then the same computations on n P tn operands can be carried out using an n P -processor square IMMB in ytn time, which is cost-optimal. 
SEMIGROUP AND PREFIX COMPUTATIONS ON (2, l)-IMMBS
The algorithms presented in the previous section can be extended to run on (2, l)-IMMBs, the 2D l-level IMMBs, where l b P. Without loss of generality, we only consider
n lYP n lÀIYI n lÀIYP Á Á Á n IYI n IYP n. For easy reference, we denote this special IMMB by PY l-IMMBn.
Clearly, there are n Pl processors in PY l-IMMBn, and these processors form an n l Â n l processor array. This processor array can be viewed as an n lÀkÀI Â n lÀkÀI array of n kI Â n kI arrays, each being a level-(k + 1) submesh. A level-k I submesh is in turn considered as an n Â n array of level-k submeshes. Let w iYjYkI , where I iY j n lÀkÀI , denote a level-(k + 1) submesh. We use buses, a subset of its level-k buses, and and a subset of its level-k bridge local links (bridge links for short) form an n Â n logical wwf defined on its level-k leaders. 
X
It is easy to verify that each level-k bus is used in at most two of these paths, and each level-(k + 1) bus is used in at most one of these paths. Using these paths, the leaders of level-k submeshes can simulate an n Â n MCC. There are n À I horizontal (respectively, vertical) level-(k + 1) buses in a level-(k + 1) submesh w iYjYkI for I k`l. ), I j H n, with a constant-factor slowdown. For k l À I, there are n horizontal level-(k + 1) buses and n vertical level-(k + 1) buses, and all of them can be used to simulate the buses connecting level-l À I leaders. For example, for the PY Q-IMMBQ shown in Fig. 2 , the connections used to simulate Q Â Q logical MMBs of level-1 leaders are shown in Fig. 5 and the connections used to simulate the Q Â Q logical MMB of level-2 leaders are shown in Fig. 6 . The shaded processor in Fig. 6 is the leader of a level-3 submesh. This IMMB is used to construct a PY R-IMMBQ. Summarizing the above discussions, we have the following lemma. Lemma 1. For any level-(k + 1) submesh w iYjYkI , where Ik`l and I iY j n lÀkÀI , in (2, l)-IMMB(n), there is a logical n Â n MMB on its level-k submesh leaders using all its level-(k + 1) buses, a subset of its level-k buses, and a subset of its level-(k + 1) bridge links. The following fact is also useful in our prefix algorithm. For brevity, we omit its proof. Lemma 2. For any level-(k + 1) submesh w iYjYkI , where Ik`l and I iY j n lÀkÀI , in PY l-IMMBn, there is a path that consists of a level-k bus and a level-(k + 1) bus from its leader to any of its level-k leader.
Let I Y P Y Á Á Á Y n P be a sequence of n P operands for the prefix computation and e be a prefix algorithm that runs in ytn P time on an n Â n MMB with each processor holding one operand. Suppose that f X fijI i n P g 3 fjY kjI jY k ng is the function used by algorithm e to map i s and s i s to processors in an n Â n MMB; i.e., if fi rY , input i and result s i I È P È Á Á Á È i are in rY before and after the computation, respectively. We want to perform prefix computation on a sequence e I Y P Y Á Á Á Y n Pl using (2, l)-IMMB(n). We partition e into n P subsequences e i , I i n P , each having n PlÀI operands, such that e i iÀInI iÀInP Á Á Á Y in PlÀI . We use fi to map e i to the level-l À I submesh w rYYl . Let We call ilÀIYilÀPYÁÁÁYikIYik the active value of unit e ilÀIYilÀPYÁÁÁYikIYik . Our algorithm consists of three phases. The first phases have l À I iterations. In the first iteration, Algorithm e is performed on all logical n Â n MMBs defined on level-1 submeshes, and the active values of these submeshes are stored in their leaders. In the kth iteration, I`k l À I, algorithm e is performed on all logical n Â n MMBs defined on the leaders of level-k À I submeshes, and the active values of all level-(k + 1) submeshes are routed to their respective leaders. By Lemma 1 and Lemma 2, each iteration takes ytn P time, which is the time required for prefix computation on n operands using an n Â n MMB. The total time for the first phase is yl tn P .
In the second phase, the leader of a level-k submesh broadcasts its active value to its level-1 leaders in the submesh. This broadcasting operation involves 1) sending its active value to one of its level-k À I leaders, 2) broadcasting the item to all its level-k À I leaders using the logical n Â n MMB defined on these level-k À I leaders, and 3) recursively broadcasting to leaders of lower levels. A bus conflict problem arises: broadcasting from the leader of a level-k submesh to all its level-k À I leaders and broadcasting from the leader of a level-k À I submesh to all its level-k À P leaders may need to use the same level-k buses. To avoid such conflicts, we use a ªpipeliningº strategy, which consists of l À P steps, Step 1 through
Step l À P. In the jth step, level-l À Pi À j leaders broadcast to the level-l À Pi À j À I leaders in their level-l À Pi À j submesh, where H i lÀj P using logical n Â n MMBs defined on the level-l À Pi À j À I leaders. By Lemma 1 and Lemma 2, there is no conflict in the use of buses in this process. It is easy to verify that after l À P steps, all data at higher level leaders are broadcast to level-1 leaders. Then, each level-1 leader broadcasts all data it has received to the processors in its level-1 submesh. The overall running time for the second phase is yl.
The task of the third phase is for each processor to update its prefix value using the data it received in the second phase. The time for this phase is obviously yl. In summary, the total time for this three-phase prefix algorithm is yltn P , assuming that Algorithm e takes yl tn P time. The first phase of this algorithm can be used to perform a semigroup operation. By the results of [11] , [21] , semigroup and prefix computations on n P operands can be carried out using an n Â n MMB yn square PY l-IMMBn using which semigroup and prefix operations on x operands can be carried out in yx time.
For any constant H``I, let x IÀ n Pl . We select l
If we distribute x operands to x
IÀ processors of this (2, l)-IMMB(n), semigroup and prefix operations on these x operands can be carried out in yx time. Hence, we have the following claim.
Corollary 3. For any constant H``I, there exists an x IÀ P Â x IÀ P square (2, l)-IMMB(n) using which semigroup and prefix operations on x operands can be carried out in yx time, which is cost optimal.
If we let n be a constant, say n Q, then semigroup and prefix computations on x operands can be performed on a square PY l-I(3) in yl time, which leads to the following claim.
Theorem 5. Semigroup and prefix computations on x operands can be performed using an ylog x-level x I P Â x I P square IMMB in ylog x time.
Let each processor hold log x operands, the following corollary is straightforward. There is no contradiction between Theorem 4 and Theorem 5. The better performance claimed in Theorem 5 is achieved by using more buses.
EXTENSION TO d-DIMENSIONAL IMMBS
where I i j n j and I j d. In a way similar to the definition of 2D IMMBs, we can formally define (d, l)-IMMBs in a recursive fashion, assuming that a d-dimensional MMB is a 1-level d-dimensional IMMB, and its buses are level-1 buses. For a dY l À I-IMMB, we call its level-l À I buses that connect the processor with the smallest index according to lexicographical order (i.e., IYIYÁÁÁYI ) its representative level-l À I buses. Clearly, there are exactly d representative level-k buses, one for each dimension. We use n d iI n lYi copies of a dY l À I-IMMB to construct a (d, l)-IMMB as follows. Arrange these dY l À I-IMMBs, which are referred to as level-l À I submeshes, as an n lYI Â n lYP Â Á Á Á Â n lYd array. Boundary processors are connected by bridge local links, and representative level-l À I buses are merged into level-l buses. For brevity, we omit the detailed formal definition. We illustrate the construction of a QY l-IMMB from QY l À I-IMMBs in Fig. 7 . In Fig. 7a , we show a QY l À I-IMMB with thick line segments designating its three representative level-l À I buses. In Fig. 7b , we illustrate the construction of a (3,l)-IMMB from nine copies of the QY l À I-IMMB shown in Fig. 7a . In this case, n lYI n lYP n lYQ Q. The representative level-l À I buses of level-(l À I) 3D submeshes are merged into level-l buses, as shown in Fig. 7b , and the three thicker line segments designate the representative level-l buses of the constructed (3, l)-IMMB. If the bus merge operation is not performed in the construction of a 2-level d-dimensional IMMB, we obtain a d-dimensional GMMB. By Theorem 1, it is easy to verify that the diameter of a (d,l)-IMMB is dl. It is also straightforward that any d-dimensional GMMB can be simulated by its corresponding 2-level d-dimensional IMMB (which has fewer buses) with a constant-factor slowdown.
It was shown in [3] , [11] that a prefix computation on x operands can be performed in yx 3D GMMB. The aspect ratio of this GMMB is x I Q . Since a 3D GMMB of [13] can be simulated by a 2-level 3D IMMB obtained by merging a subset of its buses with a constantfactor slowdown, semigroup and prefix computations on x operands can be performed in yx Let n I n IQ PR , n P n U PR , n Q n I T , and n n I n P n Q . We construct a (3,2)-IMMB by arranging n copies of n This theorem is a generalization of Theorem 2. Note that a (3,2)-IMMB can be considered as constructed from a three-dimensional GMMB by merging some of its buses. This result is an improvement over the best known yx Q 3D IMMB can be recursively constructed from an arbitrary n I Â n P Â n Q MMB. First, we construct a 2-level n P n I Â n Q n P Â n I n Q IMMB by arranging n n I n P n Q copies of an n I Â n P Â n Q MMB, which are level-1 3D submeshes, as an n P Â n Q Â n I 3D array, and properly adding bridge links and merging some level-1 buses into level-2 buses. Then, we construct a 3-level n Q n P n I Â n I n Q n P Â n P n I n Q IMMB by arranging n copies of this n P n I Â n Q n P Â n I n Q IMMB, which are level-2 3D submeshes, as an n Q Â n I Â n P 3D array, and properly adding bridge links and merging some level-2 buses into level-3 buses. The resulting structure is a 3-level n Â n Â n IMMB. Let n I n IQ PR , n P n U PR , n Q n I T . By properly selecting leaders of 3D submeshes at every level, we can identify disjoint logical MMBs and run the prefix algorithm of [3] , [11] on these MMBs in a way similar to that described in the previous section. It is simple to verify that semigroup and prefix computations on n Q operands on this IMMB takes yn In [3] , [11] , it was shown that semigroup and prefix computations can be performed on x operands using an x
In [13] , it is shown that semigroup and prefix computations can be performed on x operands using an 
given in [3] , [11] on logical MMBs defined on these levels.
It is not difficult, though tedious, to prove the following generalization of Theorems 6 and 7.
Theorem 8. Semigroup and prefix computations on x operands can be performed using an N-processor dY l-IMMB in yx Consider constructing a dY d-IMMB recursively as follows. The dY l-IMMB, I l d, is constructed by an
where n d i n i . Choosing n i s such that n i x dP dÀi I dP d , we have the following extension of Theorem 8.
Theorem 9. Semigroup and prefix computations on x operands can be performed using a d-level
CONCLUDING REMARKS
In this paper, we proposed a generalization of meshconnected computers with multiple buses, the IMMBs. Processors in an IMMB form a hierarchy of clusters (submeshes) of different sizes, and buses are partitioned into levels for fast data movement among processor clusters at different levels. The IMMBs provide more design alternatives, since in addition to d, the number of dimensions, one can also select l, the number of bus levels. For a practical x value, the number of processors, both l and d can be selected as small constants no greater than 3, though theoretically it was shown that larger l and d values lead to better performances for sufficiently large x. In our examples, l is fixed for all dimensions. It is possible to select different l values for different dimensions. The best semigroup and prefix time complexities achieved on MMBs and GMMBs so far were developed for MMBs and GMMBs with large aspect ratios. We showed that l can be used to control aspect ratios, in addition to its uses for better performances. In particular, we showed how to construct IMMBs with aspect ratios equal to 1 that have very good semigroup and prefix computation performances. This is not possible for both MMBs and GMMBs.
In a real implementation of an IMMB, buses at higher levels may be implemented with larger bandwidths by either using more wires or more expensive technology to alleviate the increased congestions. This approach is similar to the fat trees proposed in [18] and implemented in the CM-5 machine [25] . Fiber optics may be used to implement buses at the highest level. Three dimensions may be preferred to two dimensions for reducing the number of processors attached to a bus.
Solving other problems using MMBs and GMMBs has been considered [5] , [9] , [10] , [12] , [14] , [21] . For example, Chen et al. [10] designed an yx I V log x-time medianfinding algorithm for an x Q V Â x S V MMB, and Bhagavathi et al. [5] designed an selection algorithm with the same complexity. Chung [14] designed an yx I IH log x-time selection algorithm for an x Q S Â x P S GMMB. It is not difficult to show that the time complexities of selection algorithms using IMMBs are within log x times the time complexities of our semigroup and prefix algorithms using IMMBs.
Si Qing Zheng received the PhD degree in computer science from the University of California, Santa Barbara, in 1987. After having been on the faculty of Louisiana State University for 11 years (since 1987), he joined the University of Texas at Dallas, where he is currently a professor of computer science. Dr. Zheng's research interests include algorithms, computer architectures, networks, parallel and distributed processing, telecommunication, and VLSI design. He has published extensively in these areas. He served as the program committee chairman of numerous international conferences and the editor of several professional journals.
Keqin Li received the BS degree in computer science from Tsinghua University, China, in 1985, and the PhD degree in computer science from the University of Houston in 1990. He is currently a full professor of computer science in State University of New York at New Paltz. Dr. Li's research interests are mainly in design and analysis of algorithms and parallel and distributed computing, with particular interests in approximation algorithms, parallel algorithms, job scheduling, task dispatching, load balancing, performance evaluation, dynamic tree embedding, scalability analysis, and parallel computing using optical interconnects. His pioneering work on processor allocation and job scheduling on partitionable meshes has inspired extensive subsequent work by numerous researchers and created a very active and productive research field. He has published more than 140 journal and refereed conference papers. He has also coedited six international conference proceedings and a book entitled Parallel Computing Using Optical Interconnections (Kluwer Academic, 1998). 
