Extensive comparative analysis is carried out of various mesh-connected architectures that contain sparse broadcast buses for low-cost, high-performance parallel computing. The two basic architectures differ in the implementation of bus intersections. The first architecture simply allows row/column bus crossovers, whereas the second architecture implements such intersections with switches that introduce further flexibility. Both architectures have lower cost than the mesh with multiple broadcast, which has buses spanning each row and each column, but the former architectures maintain to high extent the powerful properties of the latter mesh. The architecture that employs switches for the creation of separable buses is even shown to often perform better than the higher-cost mesh with multiple broadcast. Architectures with separable buses that employ store-and-forward routing often perform better than architectures with contiguous buses that employ the high-cost wormhole routing technique. These architectures are evaluated in reference to cost, and efficiency in implementing several important operations and application algorithms. The results prove that these architectures are very promising alternatives to the mesh with multiple broadcast while their implementation is cost-effective and feasible.
I. INTRODUCTION
The mesh architecture is used frequently in parallel processing because of its low YLSI complexity and its support for scalability. However, its main drawbacks, namely large diameter and large average internode distance, affect dramatically its communication capabilities. Although other popular interconnection networks, such as the direct binary hypercube, have smaller values for the latter pair of parameters, their major drawback is that they do not permit the application of incremental growth techniques [7] and their VLSI implementation becomes a Herculean task for *Tel.: (973) 596-5651, Fax: (973) 596-5680. e-mail: ziavras@megahertz.njit.edu massively parallel systems [2, 8] . To allow the efficient implementation of distant data transfers, several enhancements have been proposed for the mesh. The addition of a single global bus is such an enhancement [12, 13] . Although it is often assumed that the propagation time of messages on the global bus is independent of the size of the mesh, this justification may not be acceptable for practical systems [3] .
To avoid bottlenecks caused by the single global bus, the mesh parallel computer can instead be augmented by adding multiple broadcast buses, where each bus connects a subset of PEs (processing elements) in the mesh. Such a popular architecture is the mesh with multiple broadcast [1, 21, 23] . It is a mesh-connected parallel computer where all PEs on each row and each column are connected to a shared row and shared column bus, respectively. The performance of this architecture is comparable to that of the pyramid computer for several image processing problems. Rectangular meshes with multiple broadcast may perform better than square ones with the same number of PEs, when row and column buses are considered, as the former systems contain more buses [3, 15, 29] . The mesh with multiple broadcast is the most relevant architecture in this paper. The rest of this section describes briefly other important variations of the mesh architecture.
The CHiP parallel computer [4] consists of a mesh (grid) of PEs with programmable switches interposed between neighboring PEs. The local memory in switches stores interconnection patterns to be implemented at run time. The reconfigurable mesh, or mesh with reconfigurable bus, is a square mesh of PEs where all PEs are connected to a global broadcast bus that spans all rows and columns [14, 22] . Switches are located at all column/row bus intersections and PEs control their neighboring switches that can divide the global bus into subbuses. The cost of this arctiitecture may be prohibitively high because of its large number of switches, whereas the assumption of fixed-time data transfers may be unrealistic.
Many algorithms have good theoretical performance on the reconfigurable mesh [17] . In the noncross-over model, the four communication ports of a PE can be connected together to form only planar connections [27] . In the higher-cost crossover model, non-planar connections can be formed as a PE may connect together its north-south and east-west ports independently.
The PEs in the polymorphic torus are located at the vertices of a two-dimensional torus network [5, 16, 25] . Switches [18] . The network is dynamically reconfigured to support either NEWS or diagonal connections. The custom chip of BLITZEN contains 128 onebit PEs arranged in an 816 array [18] . The standard configuration of this system contains 16, 384 PEs arranged in an 128128 array. The mesh of trees is constructed from a grid of PEs by adding additional PEs and wires to form on top of it a complete binary tree on each row and each column [9] . The cost of these additional PEs may be a drawback in the implementation of this architecture. Several slight variations of the mesh of trees have also been introduced [11] .
Other mesh-connected parallel computers are obtained by superimposing one or more global meshes on an underlying mesh of PEs [6] . The mesh with a single global mesh is constructed starting with several regular meshes and connecting together with a global mesh the lead PEs at the top leftmost corners of these meshes. Additional links between PEs in the original meshes can be used to form a single underlying mesh that contains all PEs. The creation of a mesh with global meshes is also possible. The first global mesh connects the lead PEs of a set of regular meshes. The second global mesh connects the lead PEs of a subset of meshes with a single global mesh. This process is repeated, and finally, the/th global mesh connects a subset of meshes containing l-1 global meshes.
To improve the performance of the mesh with multiple broadcast, processor-controlled switches can be used to partition the row and column buses for reconfiguration purposes [26] . For example, a switch can be inserted after every other PE on each row or each column bus in order to produce a mesh with separable buses. A lower-cost modification of this architecture does not require that all PEs be connected to row and column broadcast buses [24] . The [24, 29] . However, we assume here only square meshes because our objective is to evaluate architectural differences of systems as they are related to data transfer operations.
The mesh with multiple broadcast and its aforementioned variation with separable buses achieve very good performance at the expense of very high hardware cost. Their existing costperformance comparisons with alternative architectures are very limited. The objective of this paper is to show that families of low-cost meshconnected architectures can achieve performance comparable to that of the higher-cost mesh with multiple broadcast. More specifically, this paper investigates in detail two families of mesh-connected architectures with sparse broadcast buses (i.e., buses that do not cover every row and every column of the mesh). One of the architectures employs switches for the implementation of separable buses. In contrast to the work in [24] that assumes hierarchical sectioning of broadcast buses, these switches are located at all bus intersections and also connect to both row and column buses. Thus, they reduce the hardware cost further while improving the system's flexibility for many operations. Also, we assume that the underlying structure is a single/complete mesh, while [24] [21] , denoted here by MB(n), is identical to the MORB(n,n). The broadcasting structure of the MORB mesh is similar to a special instance of the mesh with separable buses that has only one sectioning level for row and column broadcast buses [24] .
A The store-and-forward and wormhole-routing switching techniques are chosen throughout this paper. The analysis of broadcasting first assumes the store-and-forward technique. Based on Theorem 1, the worst-case time for a global broadcast on the MORB(n,p) is Tglobal(n,P) rtNEWS -F-3 tbus (n,p) rtqEWS + 3 log ptbus under the and Tglobal UTD and LTD models, respectively, for p < n. For p n, the values are Tglobal(n,n) 2tbus and global(n, n) 2 log ptbus, respectively, for the two models. Let us define as the cost of implementing global broadcasting on the MORB(n,p) the product cOStglobal(n,p) PE_buses(n,p)x Tglobal(n,P), where PE buses stands for the total number of PEs attached to broadcast buses and is given in Proposition 2. This cost is in reference to the regular nxn mesh without broadcast buses, and can be used to compare the MORB and MB meshes for this operation; the MB(n) mesh has the same PE_buses as the MORB(n,n), that is 2n2. If q= tbus/tNEWS, the cost ratio cOStglobal(r/,p)/ cOstglobal(n,n) becomes p(r + 3q)/(2 n q) and p(r + 3 q log p)/(2 n q log p) under the UTD and LTD models, respectively. The smaller the value of this ratio, the better the MORB(n,p) system is for global broadcasting (i.e., it has a better balance between hardware cost and performance). Under the UTD model, this ratio is approximately equal to (r + 3 q)/(2rq). Similarly, the time ratio Tglobal (n, P)/Tgloba (n,n) is approximated by (r + 3q)/(2q) for the UTD model. If tbus--tNEWS tALU, the cost ratios (cOStall_OR (n, p)/cOStall_OR(n, n)) and (cOStal_OR(n p)/ cOStall_OR (n, n)) under the UTD model are shown in Figure 6 for practical values of n and p. Figure 6 shows that the cost ratio is very small for both families of MORB meshes. The last ,algorithm also can be used to find in the same amount of time the maximum or minimum of n mesh was presented in [10] . The In the second phase, an algorithm that employs binary tree emulation is used [11] . Its Figure 9 shows the speedup T'grenx(n,n)/ Tp'grex(n,p) for the comparison of the MORag(n, p) with the higher-cost MB(n) mesh with broadcast buses; we assume that tALU 1 
