Abstract
Introduction
The mesh architecture is one of promising models for parallel computing. Its natural structure is suitable for VLSI implementation and allows a high degree of integration. However, the mesh has a crucial drawback that its communication diameter is quite large due to lack of broadcasting mechanism. To overcome this problem, many researchers have considered adding broadcasting buses to the mesh [1, 2, 3, 8, 9, 10, 11] . Among them, in this paper we deal with the meshes with separable buses (MSB) [3, 9] and a variant of the meshes with partitioned buses [1, 2] called the meshes with multiple partitioned buses (MMPB).
The MSB and the MMPB are the mesh connected computers enhanced by the addition of broadcasting buses along every row and column. The broadcasting buses of the MSB, called separable buses, can be dynamically sectioned into smaller bus segments by program control, while those of the MMPB, called partitioned buses, are statically partitioned in advance and cannot be dynamically re-configurable. In the MSB model, each row/column has only one separable bus, while in the MMPB model, each row/column has L partitioned buses (L ≥ 2).
In this paper, we show that the MMPB of size n × n can simulate the MSB of size n × n in O(n 1/(2L) ) steps. This is the extension of our previous result that the MMPB of size n × n with L = 1 can simulate the MSB of size n × n in O(n 1/3 ) steps [4, 5] . From a theoretical view point, since we have shown that the MSB of size n × n can simulate the reconfigurable mesh [7, 11] (or PARBS, the processor array with reconfigurable bus systems) of size n × n in O(log 2 n)
steps [6] , we can show that any problem that is solved in T steps by the reconfigurable mesh of size n × n can be solved in O(T n 1/(2L) log 2 n) steps by the MMPB of size n × n. It has been argued that the reconfigurable mesh can be used as a universal chip capable of simulating any equivalent-area architecture without loss in time [7] , our result gives the upper bounds in time for the MMPB for simulating other equivalent-area architectures. Furthermore, we also consider the scaling-simulation problem of the MSB, and show that the MMPB of size m × m can simulate the . This paper is organized as follows. Section 2 describes the MSB and the MMPB models. Section 3 presents an algorithm that simulates the n × n MSB on the n × n MMPB, and Section 4 gives an scaling-simulation algo- rithm that simulates the n × n MSB using the m × m MMPB (m < n). And finally, Section 5 offers concluding remarks.
Models
An n × n mesh consists of n 2 identical SIMD processors or processing elements (PE's) arranged in a twodimensional grid with n rows and n columns. The PE located at the grid point (i, j), denoted as PE [i, j] , is connected via bi-directional unit-time communication links to those PE's at (i ± 1, j) and (i, j ± 1), provided they exist
An n × n mesh with separable buses (MSB) and an n × n mesh with multiple partitioned buses (MMPB) are the n × n meshes enhanced with the addition of broadcasting buses along every row and column. The broadcasting buses of the MSB, called separable buses, can be dynamically sectioned through the PE-controlled switches during execution of programs, while those of the MMPB are statically partitioned in advance by a fixed length. In the MSB model, each row/column has only one separable bus (Figure 1) , while in the MMPB model each row/column has L partitioned buses for L ≥ 2 ( Figure 2 ). Those L partitioned buses of the MMPB are indexed as level-1, level-2, . . ., level-L, respectively. For each level-l, the value l denotes the length of a bus segment of the partitioned bus in level-l. Broadcast substep: Every PE changes its switch configurations by local decision (this operation is only for the MSB). Then, along each broadcasting bus segment, several of the PE's connected to the bus send data to the bus, and several of the PE's on the bus receive the data transmitted on the bus.
Compute substep: Every PE executes some local computation.
The bus accessing capability is similar to that of Common-CRCW PRAM model. If there is a write-conflict on a bus, the PE's on the bus receive a special value ⊥ (i.e., PE's can detect whether there is a write-conflict on a bus or not). If there is no data transmitted on a bus, the PE's on the bus receive a special value φ (i.e., PE's can know whether there is data transmitted on a bus or not).
Simulation of the n × n MSB by the n × n MMPB
In this section, we consider how to simulate a single step of the n × n MSB using the n × n MMPB. Given a single step of the simulated MSB in such a way that each PE[i, j] of the simulating MMPB knows only how corresponding PE[i, j] of the MSB behaves at this single step, we consider how to achieve the same computational task of the step on the MMPB. We assume that the computing power of PE's, the bandwidth of local links, and that of broadcasting buses are equivalent in both the MSB and the MMPB.
In what follows, we focus on how to simulate the broadcast substep of the MSB using the MMPB, because the local communication and the compute substeps of the MSB can be easily simulated in a constant time by the MMPB.
To begin with, we consider the case where L = 2.
Lemma 1 For any single step of the n × n MSB, the broadcasts taken on the separable bus in row i (resp. column i) of the MSB can be simulated in row
Proof : Assume 1 = n and 2 = n 1/2 , and take any single step S of the MSB and any row index i ∈ {0, 1, . . . , n− 1}. Let us consider simulating the broadcasts taken on the separable bus along row i of the MSB only, those on the bus along column i of the MSB can be simulated similarly. To simplify the exposition, let P j and P j respectively denote
First, we define some notations to describe the broadcasts to be simulated. To distinguish the two ports through which a PE has access to the separable bus, we refer to the port on the left side of the sectioning switch as port L and the other as port R, as shown in Figure 1 . Then, the broadcasts are carried out in the following way: (1) several of P 0 , P 1 , . . . , P n−1 section the bus, (2) several of P 0 , P 1 , . . . , P n−1 send data to the bus through port L and/or R, and (3) several of P 0 , P 1 , . . . , P n−1 receive data from the bus through port L and/or R. W.r.t. these broadcasts performed in the row-separable bus of the MSB, we define C • C x j = {(k, y) | port x of P j and port y of P k belong to the same bus segment after the broadcasting bus being sectioned}.
• s x j = a if P j sends data a to port x, otherwise s x j = φ.
• r x j = (the data received by P j from port x).
To describe each r x j using C * * and s * * , we define a binary commutative operator ⊕ in such a way that it satisfies the following equations for any x and y:
It is not difficult to confirm that ⊕ is well-defined and enjoys the associative law. Then, each r We divide P 0 , P 1 , . . . , P n−1 into n 1/2 disjoint blocks B 0 , B 1 , . . . , B n 1/2 −1 in a way that each B p consists of 
Lemma 2 For any single step of the n × n MSB, the broadcasts taken on the separable bus in row
Proof : Let us consider simulating the broadcasts taken on the separable bus along row i of the MSB only, those on the bus along column i of the MSB can be simulated similarly. Let T k (n) denote the time cost for simulating the broadcasts taken along the separable bus in row i of the n × n MSB using row i of the n × n MMPB with L = k (k ≥ 2). We prove the lemma by Mathematical Induction.
• Base case: For k = 2, from Lemma 1, we have
• Inductive case: For k > 2, we prove −1) ) ) holds. We let 1 = n. We modify the 3-phase simulation algorithm proving Lemma 1 in such a way that we divide P 0 , P 1 , . . . ,
Then, with the k − 1 partitioned buses other than the level-1 bus, Phase 1 and 3 can be executed in
Phase 2 can be completed in O(
Hence, the conclusion follows. Since the local communication and compute substeps of the MSB can be simulated obviously in a constant time by the MMPB, Lemma 2 immediately implies the following theorem:
Scaling-Simulation of the n × n MSB by the m × m MMPB
In this section, we consider simulating a single step of the n × n MSB by the m × m MMPB (m < n). Let P i,j and P i,j respectively denote PE[i, j] of the MSB and PE[i, j] of the MMPB. To simplify the exposition, we assume that n mod m = 0. We define the processor mapping as follows: each P i,j simulates P x,y (i n m ), we consider how to achieve the same computational task of the step using the MMPB. We assume that the computing power of PE's, the bandwidth of local links, and that of broadcasting buses are equivalent in both the MSB and the MMPB.
By the results in the previous section, we can prove the following lemma.
Lemma 3
For any single step of the n × n MSB, the broadcasts taken on the separable bus in row i (resp. column i) of the MSB can be simulated in row
Proof : Here, let us consider simulating the broadcasts taken on the separable bus along row i of the MSB using row i m of the MMPB only, those on the bus along column i of the MSB can be simulated similarly.
P j , C 
Concluding Remarks
We consider the simulation and the scaling-simulation problem of the MSB by the MMPB, and obtained the following results:
1. The MMPB of size n × n can simulate the MSB of size n × n in O(n 1/(2L) ) steps (L ≥ 2). From a practical view point, compared to the MSB, the MMPB model has the advantage that the propagation delay introduced by the length of the bus (signal propagation delay) and those switch elements inserted to the bus (device propagation delay) can be small, and hence our simulation algorithms are useful when the mesh size becomes so large that we cannot neglect the delay.
