Simulating the CRCW PRAM on reconfigurable networks  by Biing-Feng, Wang
Theoretical 
ELSEVIER Theoretical Computer Science 205 (1998) 231-242 
Computer Science 
Simulating the CRCW PRAM on reconfigurable networks’ 
Biing-Feng Wang* 
Department of Computer Science, National Tsing Hua University Hsinchu, Taiwan 30043, ROC 
Received October 1995 
Communicated by 0. Watanabe 
Abstract 
This paper addresses the problem of simulating the CRCW PRAM on reconfigurable 
networks. Let N and M, respectively, be the numbers of processors and memory cells contained 
in the CRCW PRAM. We firstly show that a two-dimensional N x MN”’ reconfigurable 
network can simulate any operation performed on the CRCW PRAM in O(1) time, where r 2 2 
and is a constant. Then, if N < M, we further show that any operation performed on the CRCW 
PRAM can be simulated as well as O(1) time on a r-dimensional N”(‘- ‘) x N”(‘- ‘) x 
. . . xN’“‘-” x(M/N”-2’/“-“) reconfigurable network, where r 2 2 and is a constant. 
0 1998-Elsevier Science B.V. All rights reserved 
Keywords: Simulation; Parallel algorithms; CRCW PRAM; Reconfigurable buses 
1. Introduction 
Two general models of parallel computation are usually considered when designing 
parallel algorithms, one in which the processors hare a common memory and one in 
which the memory is distributed among the processors [1,7]. The Concurrent Read, 
Concurrent Write (CRCW) Parallel Random Access Machine (PRAM) is an example 
of the share memory model, in which no communication and memory access restric- 
tions exist. In contrast, bounded-degree interconnection etworks such as tree, ring, 
and mesh are examples of the distributed memory model. The processors are intercon- 
nected in a bounded-degree network, in which each processor has a relatively small 
(constant size) local memory. Not all processors can communicate directly thus 
constraining the flow of data and increasing the running time of many algorithms. It 
should be noted that bounded-degree interconnection etworks are practical models 
*Tel.: 886-35-742805; fax: 886-35-723694, e-mail: bfwang@cs.nthu.edu.tw. 
‘This research is supported by the National Science Council of the Republic of China under Grant 
NSC-83-0408-E-007-013. 
0304-3975/98/$19.00 0 1998 - Elsevier Science B.V. All rights reserved 
PII SO304-3975(97)00096-O 
232 B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 
that are well suited for current technology, whereas the CRCW PRAM is an idealized 
model which remains technologically infeasible to implement. 
The problem of simulating the PRAM by bounded-degree interconnection net- 
works has been addressed in extensive literature. In a sequence of papers [2, 5,6, 
10, 181, the probabilistic and deterministic upper bounds for simulating each opera- 
tion performed on a PRAM with N processors and M memory locations were reduced 
from O(N) to O(log N) (with probability tending to 1) in the probabilistic case and 
O(log’ N) in the deterministic ase, where M is polynomial in N. 
Recently, some parallel computers have been equipped with reconfigurable bus 
systems to solve the problems more efficiently [3,12,13, 151. Bus automata [3], 
polymorphic-torus networks [12], and reconfigurable meshes [15] are three exam- 
ples. Conceptually, they all belong to the family of two-dimensional processor arrays 
with reconfigurable bus systems (abbreviated to 2-D PARBSs in the rest of this paper). 
It has been shown that many important problems (such as the sorting problem 
[4,8,14,17], the transitive closure problem [19], and the longest common subsequence 
problem [3]) can be solved in 0( 1) time on the PARBS. A comprehensive catalogue of 
published papers on reconfigurable bus architectures can be found in [16]. 
In this paper, fast algorithms for simulating the CRCW PRAM on the PARBS are 
proposed. Let C be a problem that can be solved in O(T) time on a CRCW PRAM 
that obeys the dynamic priority conflict-resolution rule [l 11. Suppose that N proces- 
sors and M memory locations are used when we solve C on the PRAM. In [20], by 
simulating, Wang and Chen showed that C can be solved as well in O(T) time on 
a 2-D N x MN PARBS. In this paper, also by simulating, we firstly show that the 
problem C can be solved in O(T) time on a 2-D N x MN”’ PARBS, where r > 2 and 
is a constant. Then, if N d M, we further show that C can be solved in O(T) time on 
a r_D N’/(‘- 1) x N’/(‘- 1) x . . . x Ni/(‘- 1) x (M/N’r-2)“‘-“) PARBS, where Y > 2 and is 
a constant. Hence, we show that solving the problem C in O(T) time using O(N”M) 
processors can be realized by adopting PARBSs of high dimensions. 
The remainder of this paper is organized as follows. In the next section, we 
introduce the PRAM and the PARBS. In Section 3, fast algorithms for simulating the 
CRCW PRAM on the PARBS are proposed. Finally, in Section 4, we conclude with 
some final remarks. 
2. The CRCW PRAM and the PARBS 
2.1. The CRCWPRAM 
The CRC W PRAM can contain an arbitrary number of processors, each of which is 
identified by a unique index. All the processors share an arbitrary-size common 
memory and are allowed to read from or write into a memory location of the common 
memory simultaneously. Each memory access requires one time unit. While more 
than one processor simultaneously writes into a memory location, there are several 
B-F. Wang J Theoretical Computer Science 205 (1998) 231-242 233 
rules for conflict resolution [ 11 J. Throughout this paper, we assume that the CRCW 
PRAM obeys the dynamic priority conjlict-resolution rule in which processors are 
assigned different priorities dynamically and the value written by the processor with 
the highest priority is retained in the memory location after simultaneous writing. We 
also assume that all processors execute the same instruction at any time, i.e., the single 
instruction stream, multiple data streams (SIMD) model. 
2.2. The PARBS 
A 2-D N, x Nz PARBS consists of an N, x Nz array of processors arranged as a 2-D 
grid and connected by a reconfigurable bus system. Processors of the N, x Nz array 
aredenotedbyPi,j’s,i=O,l ,..., N,-landj=O,l,..., N,--l.Eachprocessorhas 
four ports, denoted by I-, Ii, J-, and J+. The processors are connected to the 
reconfigurable bus system through the ports. In Fig. 1, we depict a 2-D 6 x 4 PARBS. 
The configuration of the bus system is dynamically changeable by adjusting the local 
connection between ports within each processor. For example, by connecting port I- 
to port I+ within each processor, straight buses in Z-direction can be established to 
connect the processors in the same row together (See Fig. 2). Each processor can 
J 
j” 
Fig. 1. A 2-D 6 x 4 PARBS 
J -i 0 1 2 3 4 5 
j 0 
Fig. 2. Straight buses in I-direction are established by connecting port I- to I’ within each processor. 
234 B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 
communicate with other processors by broadcasting values on the bus system. Each 
broadcasting requires one time unit. When more than one processor attempts to 
broadcast values on the same bus simultaneously, a collision occurs and the final 
value received is unexpected. 
The construction of the 2-D PARBS can be easily extended to higher dimensions. 
For example, in a 3-D N1 x N2 x N3 PARBS, processors are denoted by Pi,j,k’S, 
i =o, 1, . . . ,Ni-l,j=O,l,..., N2 -1, and k =O, 1, . . , N3 -1, and each processor 
has six ports, denoted by I-, I+ J-, J+, K-, and K+. For convenience of description, 
in this paper, we use P*,+ 0 < k d N3 - 1, to represent he set of processors Pi, j, is, 
i =o, 1, . . . ,N1--landj=O,l ,..., Nz - 1, of a 3-D Ni x Nz x N3 PARBS. Indeed, 
P *, .+, k forms a 2-D N1 x N2 sub-PARBS. Also, we define P*,j, * and Pi, *, * similarly. 
3. Simulating the CRCW PRAM on the PARBS 
In this section, fast algorithms for simulating the CRCW PRAM on the PARBS are 
proposed. 
Lemma 3.1. The maximum of N values can be determined in O(1) time on a 2-D 
N x N”’ PARBS, where r 3 1 and is a constant. Initially, the N values are stored in 
processors Pi, gls, i = 0, 1, . . . , N - 1. 
Proof. We partition the 2-D N x N”’ PARBS into N(*- ‘jir 2-D N”’ x N”’ sub- 
PARBSs. Our algorithm consists of r phases. In the first phase, we let each sub- 
PARBS to determine the maximum of the N”’ values stored in it. Using Miller et al’s 
0( 1) time algorithm for finding the maximum [lSJ, this phase requires O(1) time. 
Then, we only need to determine the maximum of the remained N(‘-‘)‘* values. By 
moving the remained values to appropriate processors and then repeatedly applying 
the same idea r - 1 times (in phase 2,3, . . . , and r), the maximum of the N input values 
can be determined. Since r is a constant, totally O(1) time is required. 0 
Using Lemma 3.1 to improve Wang and Chen’s simulation in [20], the following 
result can be obtained immediately. 
Theorem 3.2. Any operation performed on a CRCW PRAM that obeys the dynamic 
priority conjict-resolution rule can be simulated in O(1) time on a 2-D N x MN”’ 
PARBS, where N and M are the numbers of processors and memory locations contained 
in the PRAM, and r 3 1 and is a constant. 
In Theorem 3.2, O(MN’ + ‘lr) p rocessors are used to simulate the CRCW PRAM. 
In the following, we show that the number of used processors can be reduced if 
N < M. 
B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 235 
Lemma 3.3 (Chen and Chen [4], Jang and Prasanna [S] and Lin et al. [14]). Sorting 
N data items can be completed in O(1) time on a 2-D N x N PARBS. 
Theorem 3.4. If N d M, any operation performed on a CRCW PRAM that obeys the 
dynamic priority conflict-resolution rule can be simulated in O(1) time on a 2-D N x M 
PARBS, where N and M are the numbers of processors and memory locations contained 
in the PRAM. 
Proof. At any time, the CRCW PRAM may perform a read operation, a write 
operation, or an arithmetic (logical) operation. Certainly, any arithmetic (logical) 
operation of the CRCW PRAM can be executed in O(1) time on the 2-D PARBS. It 
has been shown [20] that any read operation of the CRCW PRAM can be simulated 
in O(1) time on the 2-D PARBS. To complete the proof, we need only to show that 
any write operation of the CRCW PRAM can be simulated in O(1) time on the 2-D 
PARBS. With the help of Lemma 3.3, such a simulation can be obtained by implemen- 
ting the algorithm that will be proposed later for Case 2 in the proof of Theorem 3.11. 
The implementation is not very difficult and thus omitted here. 0 
In the following, we show that by adopting a 3-D PARBS, the number of processors 
used to simulate a CRCW PRAM can be further reduced. 
Lemma 3.3. (Chen and Chen [4]). Sorting N data can be completed in O(1) time on 
a 3-D N”’ x N112 x N1j2 PARBS. Initially, the N data items are stored in the 2-D 
Sub-PARBS P,,,,,. After sorting, the sorted sequence is stored in row-major order on 
the sub-PARBS. 
Lemma 3.6 (Miller et al. [ 151). Permutation routing of N data items can be performed in 
O(1) time on a 2-D N x N PARBS. Initially, the N data items are stored in processors 
Pi,o’S, i ~0, 1, . . . , N -1. 
With a slight modification, we can easily obtain the following lemma from the 
routing algorithm proposed in [lS] for Lemma 3.6. 
Lemma 3.7. On a 2-D N1 x N2 PARBS, any subset S of processors Pi,o’S, 
i =o, 1, . . . , N1 - 1, can transmit data items (one data item by each processor) to another 
subset D of processors PO, j’s, j = 0, 1, . . . , N2 - 1, in O(1) time, ifno two data items have 
the same destination (note that 1 D 1 = ) S I). 
Lemma 3.8. Let X = (x0, x1, . . . , xN_ 1) be a sequence stored in row-major order in the 
2-D sub-PARBS P*,*,O of a 3-D N II2 x N1” x N”’ PARBS. In O(1) time, we can 
rearrange the sequence such that it is stored in snake-like row-major in the 2-D 
sub-PARBS P*,*,O. 
236 B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 
Proof. The rearrangement can be completed by letting each 2-D sub-PARBS P*,j,*, 
0 d j 6 N112 - 1 and j is odd, reverse the sub-sequence xj(Nl/z~, xj(Nl/zj+ i, . . . , and 
x~~+~)(~L:~~_~. By Lemma 3.6, the above permutation routing requires only O(1) 
time. 0 
Lemma 3.9. If N < M, on a 3-D N x N x M PARBS, any subset S of processors Pi,j, O’S, 
i =o, 1, . . . ) N-l andj=O,l,..., N - 1, can transmit data items to another subset 
D of processors PO, ,,, c‘ s u=O,l,..., N-landu=O,l,..., M-l,inO(l)time,ifno , 
two data items have the same destination. 
Proof. Suppose that each processor Pi,j,o E S attempts to transmit a value di,j to the 
processor with index (0, u~,~, Vi,j). We prove this lemma by presenting an O(1) time 
algorithm as follows. (We explain the algorithm by an example of N = M =3, 
S = (Po,o, PLO, p2.0, PLI, Po,z, Pz,z), and ((u~,~, u~.~), ho, UL~), (u~,~, u~,~), 
h,l,~l,l)~ (uo,z, uo,z), (u2,2,fJ2,2) = ((0, 11, (2, Oh (2, I), @,a, (1, (3, (1, l)).) 
Step 1: Each processor Pi,j,o, 0 < i, j d N - 1, generates a three-item record [DES-U, 
DESV, I/AL], called message-record, where 
i 
DESU = ai,j, DES-V = Ui,j, and VAL = di.j if Pi,j,o E S, and 
DESU = N, DESV = M, and VAL =0 otherwise. 
The message-records generated by processors in S are said to be active and the 
others are said to be dummy (see Fig. 3(a)). 
In the following steps, we will transmit each active message-record [DESU, 
DESV, VAL] to the processor with index (0, DES-U, DES-V). 
Step 2: Sort the N message-records nondecreasingly according to their DES-V’s 
After sorting, dummy records are discarded by the processors holding them 
(see Fig. 3(b)). Since M > N, by Lemma 3.5, this step can be completed in O(1) 
time. 
Step 3: By establishing straight buses in K-direction, each processor Pi,j,o, 0 d i, 
j < N - 1, that holds an active message-record [DES-U, DES-V, VAL] 
transmits the message-record to processor Pi, j,o,s _ v through the established 
bus which it is connected to (see Fig. 3(c)). 
Step 4: By establishing straight buses in J-direction, each processor Pi,j,k, 0 < i, 
j d N - 1 and 0 d k d M - 1, that holds an active message-record [DESU, 
DES-V, VAL] transmits the message-record to processor Pi,D,k through the 
established bus which it is connected to (see Fig. 3(d)). As we will see later in 
Claim 3.10, after Step 3, for each k, 0 < k d M - 1, on each column of the 2-D 
Sub-PARBS P*,*,k, at most one processor holds an active message-record. 
Thus, Step 4 is conflict-free (i.e., no two processors transmit message-records 
through the same established bus in J-direction). 
After this step, each active message-record [DESU, DES-V, I/AL] is stored 
in one of the processors in row 0 of the sub-PARBS P.+*,DES _v. 
B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 231 
k=O 
(a) After Step 1. 
k=O 
(b) After Step 2 
I 
k=O 
k=O k=l 
J 
Jo 
I 
0 I 2 
BH 
[1,0,& 
I2,0,~~10 
k=O k=i 
r-’ 
4 0 1 2 
i0 Pm~oo 
’ W,J20 [l,l,d22 
2 EEI 
k=l k=2 
(c) After Step 3 
(d) After Step 4. 
+’ 
J 
i0 
1 
2 
0 1 2 
P,l,doo 
EEI [l,l,d22 v, 1 I d20 
(e) After Step 5 
I---’ 
J 0 1 2 
i0 
I 
lz!H 
PJ,~~ll 
2 
4 0 1 2 
i0 L2.2, ti I I 
I 
2 
EEI 
k=2 
J 0 1 2 
Fig. 3. An illustrative example for Lemma 3.9 
k=2 
Step 5: On each 2-D N x N sub-PARBS P.+,*,k, 0 < k < M - 1, each processor Pi,O.k, 
that holds an active message-record [DES-U, DES-V, VAL] transmits the 
active message-record to processor P,,DES_U,k (see Fig. 3(e)). 
By Lemma 3.7, this requires O(1) time. 0 
238 B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 
Claim 3.10. After Step 3 of the algorithm proposed in the proof of Lemma 3.9, for each k, 
0 < k d M - 1, on each column of the 2-D Sub-PARBS P.+*,k, at most one processor 
holds an active message-record. 
Proof. Note that each active message-record [DES-U, DES-V, VAL] has a unique 
pair (DES_U, DES-V). Thus, for each k, 0 Q k < M - 1, there are at most N1” active 
message-records [DES-U, DES-V, VAL]‘s with DES-V = k. After Step 2, the active 
message-records are sorted nondecreasing according to their DES-V’s Since the 
sorted sequence of active message-records is stored in row-major order and there are 
at most N112 active message-records [DES-U, DES-V, I/AL]% with DES-V = k, we 
can easily conclude that after Step 2 no two active message-records [DES-U, DES-V, 
VAL]‘s with DES-V = k are stored in the same column of P.+*,O (see Fig. 3(b)). 
Clearly, this fact establishes the lemma. 0 
Theorem 3.11. If N d M, any operation performed on a CRCW PRAM that obeys the 
dynamic priority conflict-resolution rule can be simulated in O(1) time on a 3-D 
N”’ x N1’2 x (M/N”2) PARBS, where N and M are the numbers of processors and 
memory locations contained in the PRAM. 
Proof.DenoteA,‘s,p=O,l,..., N-l,andB,‘s,q=O,l,..., M-l,astheNpro- 
cessors and M memory locations of the CRCW PRAM. We let each Pi,j,o, 0 < i, 
j < N1/* - 1, acts for A,, where p = i + j(N112), and a memory locations in P,,,,,, 
0 d u < N112 - 1 and 0 < v < (M/N’j2) - 1, acts for B,, where 4 = u + v(N”‘). In the 
following, two cases are discussed with respect o read and write operations, respec- 
tively. 
Case 1: Read operation. Suppose the memory location read by A,, 0 < p d N - 1, 
is B,Q), 0 < h(p) < M - 1. An O(1) time algorithm for simulating the read operation is 
as follows. (We explain the algorithm by an example of N = M =9 and 
(h(O), h(l), . . . , h(8)) = (2, 1,6,2,2, 4 29% 2)). 
Step 1: Each processor Pi,j,o, 0 < i, j < N”2 - 1, generates a two-item record [DES, 
SRC], called quest-record, where DES = h(p), SRC = p, and p = i + j(N’j2) 
(see Fig. 4(a)). 
Step 2: Sort the N quest-records increasingly. The major and secondary keys for 
sorting are DES and SRC, respectively. Then, rearrange the sorted quest- 
records (which are currently stored in row-major order in the 2-D sub- 
PARBS P *, *,0) such that it is stored in snake-like row-major order in the 2-D 
sub-PARBS P*,,,O (see Fig. 4(b)). 
By Lemmas 3.5 and 3.8, this step requires O(1) time. Let QR, 
(= [DES,, SRC,]), 0 6 s d N - 1, denote the sth smallest quest-record. 
After Step 2, QR, and QR,. 1, 0 d s < N -2, are stored in two neighboring 
processors. 
B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 239 
k=O 
(a) After Step 1, 
i 
0 I 2 
Step 3: 
Step 4: 
Step 5 
Step 6: 
k=O 
(d) After Step 5 
k=O 
(b) After Step 2 
J- 
i 
0 I 2 
2 
k=O 
(e) Step 6 
J 
_i 
0 
I 
1 1 
2 
2 2 
~ 
2 
2 5 
k=O 
(c) Step 3 
k=O 
(f) After Step 6 
Fig. 4. An illustrative example for Case 1 in the proof of Theorem 3.11. 
For each s, 0 < s d N -2, the processor holding QR, transmits one copy of 
DES,7 to the processor holding QR,+I (see Fig. 4(c)). Then, the processor 
holding QR, + 1 sets a flag fs+ 1 as 1 if DES, # DES,. t, and 0 otherwise. And 
the processor holding QRo sets a flagfo as 1. 
Note that fs = 1, 0 d s d N - 1, iff QR, was generated (in Step 1) by the 
processor with smallest i-index among those processors that attempt to read 
the value of B,, where u = DES,, simultaneously. 
For each s, 0 < s Q N -2, the processor holding QR, transmits its i- and 
j-index to processor P,,+, if .f, = 1, where u = (DES, mod N1”) and 
u = (DES, div N112). 
By Lemma 3.9, this step can be done in O(1) time. 
For each pair (j, k), 0 < j < N ‘I2 - 1 and 0 < k < (M/N ‘I’) - 1, if processor 
PO,j,k has received an index pair (x, y) in Step 4, then processor P0,j.k transmits 
the content of B, to processor PX,Y,O, where q = j + k(N”2) (see Fig. 4(d)). 
By reversing the execution of the data routing in Step 4, this step can be done 
in O(1) time. Note that after Step 5, for each s, 0 d s d N - 1 andf, = 1, the 
processor holding UR, has one copy of B,, where q = DES,. 
For each s, 0 < s < N - 1 and_& = 1, the processor holding QR, broadcasts the 
content of B,, where q = DES,, to all other processors that hold QR,‘s with 
DES, = DES,, where s < t 6 N -1 and ACTIVE, =O. 
240 B-F. Wang J Theoretical Computer Science 205 (1998) 231-242 
6.1. Establish a snake-like bus to connect all processors of the 2-D 
sub-PARBS I’*, .+_, 0 together. Then, split the snake-like bus into sev- 
eral sub-buses by letting processor with fs = 1 disconnect their 
local connections (see Fig. 4(e)). Note that after the splitting, processors 
holding quest-records with the same DES are connected to the same 
sub-bus. 
6.2. Through the sub-buses, each processor withf, = 1 broadcasts the content 
of B,, where 4 = DES,, to all other processors that hold QR,‘s with 
DES, = DES,. 
After this step, for each s, 0 < s < N - 1, the processor holding QR, has one 
copy of B,, where q = DES, (see Fig. 4(f)). 
Step 7: For each s, 0 d s d N - 1, the processor holding QR, generates a two-item 
response-record [R-DES, T/AL], in which R-DES = SRC, and VAL = (the 
content of B,), where q = DES,. 
Step 8: Sort the N response-records increasingly according to their R-DES’s 
After sorting, each processor Pi,j,o, 0 d i d N - 1, has one copy of I&,) which 
is contained in the response-record stored in it, where p = i + j(N”‘). 
Case 2. Write operation. Suppose A,, 0 < p < N - 1, attempts to write a value d, 
into the memory location Bhu,), 0 < h(p) d M -1, and the priority of A, is gp. For 
convenience of description, we assume that the priority of A, is higher than that of 
A,, if gp < ga,. 
Step 1: Each processor Pi,j,o, 0 < i,j < N1’2 - 1, generates a three-item record [DES, 
PRI, I/AL], called update-record, where DES = h(p), PRI = gp, VAL = d,, 
and p = i +j(N’j2). 
Step 2: Sort the N update-records increasingly. The major and secondary keys for 
sorting are DES and PRI, respectively. Then, rearrange the sorted update- 
records such that they are stored in snake-like row-major order in the 2-D 
sub-PARBS P.+, *, ,,. 
Step 3: Let UR, (= [DES,, PRI,, I/AL,]), 0 d s d N - 1, denote the sth smallest 
update-record. For each s, 0 d s d N -2, the processor holding UR, trans- 
mits one copy of DES, to the processor holding UR,+ 1. Then, the processor 
holding UR,. 1 sets a flagf,, 1 as 1 if DES, # DES,+, , and 0 otherwise. And, 
the processor holding UR,, sets a flagf, as 1. 
Note thatf, = 1,0 < s d N - 1, iff UR, was generated by the processor with 
highest priority among those processors that attempt to write values into B,, 
where u = DES,, simultaneously. 
Step 4: For each s, 0 d s d N - 1, the processor holding UR, transmits VAL, 
to processor PO,.,,: if fs = 1, where u = (DES, mod N1’2) and u = 
(DES, div N’12). 
Step 5: For each pair (j, k), if processor P0,j.k has received a value in Step 4, then the 
received value is taken as the new content of B,, where q = j + k(N1’2). 0 
B-F. Wang J Theoretical Computer Science 205 (1998) 231-242 241 
Since sorting N data items can be done in O(1) time on a r-d 
~l,(*-l) XNli(‘-l) x . . . x N1!(‘- ‘) PARBS [4] where r 3 2 and is a constant, we can 
further extend the algorithms proposed in Thekern 3.11 to algorithms on PARBSs of 
high dimensions. We leave the extension as an exercise for interested readers and 
summarize the result in the following theorem. 
Theorem 3.12. If N < M, any operation performed on a CRCW PRAM that obeys the 
dynamic priority corzflict-resolution rule can be simulated in O(1) time on a r-d 
Nl:(‘- 1) x . . . xN’/(*-“x(M/N (r-Z).‘(rp “) PARBS, where N and M are the numbers qf 
processors and memory locations contained in the PRAM, and r 3 2 and is a constant. 
4. Concluding remarks 
Let us consider the case M is much smaller than N. A critical step of the simulation 
algorithm proposed by Wang and Chen [20] is to determine all of the maxima of 
M sets, where each set consists of N values. Since the maximum of N values can be 
found in O(log N) time on a 1 -D PARBS of N processors [15], using Wang and 
Chen’s result, an O(log N) time simulation algorithm on a 2-D N x M PARBS can be 
obtained trivially. It has been shown that the maximum of N values can be found in 
O(log log N) time on a 2-D N”’ x N”2 PARBS [15]. The question of whether each 
operation performed on a CRCW PRAM with N processors and M memory loca- 
tions, where M is much smaller than N, can be simulated in O(log log N) time (or less 
than O(log N) time) on a 2-D N x M PARBS is still open. 
References 
[I] S.G. Akl, The Design and Analysis of Parallel Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1989. 
[2] H. Alt, T. Hagerup, K. Mehlhorn, F.P. Preparata, Simulation of idealized parallel computers on more 
realistic ones, SIAM J. Comput. 16(S), (1987) 808-835. 
[3] D.M. Champion, J. Rothstein, Immediate parallel solution of the longest common subsequence 
problem, in: Proc. Internat. Conf. on Parallel Processing, 1987. pp. 70&77. 
[4] Y.C. Chen, W.T. Chen, Constant time sorting on reconfigurable meshes. IEEE Trans. Comput. C-43 
(6) (1994) 749-751. 
[S] K.T. Herley, Efficient simulations of small shared memories on bounded degree networks, in: Proc. 
30th Ann. Symp. on Foundations of Computer Science, 1989, pp. 390&395. 
[6] K.T. Herley, G. Bilardi, Deterministic simulations of PRAMS on bounded degree networks, in: Proc. 
26th Ann. Allerton Conf. on Communication, Control and Computation, 1988, pp. 10X4&1093. 
[7] J. Ja’Ja’, An introduction to parallel algorithms, Addison-Wesley, Reading. MA, 1992. 
[8] J.W. Jang, V.K. Prasanna, An optimal sorting algorithm on reconfigurable mesh, in: Proc. lnternat. 
Parallel Processing Symp., Beverly Hills, CA, 1992, pp. 130-I 37. 
191 J.F. Jeng, Sartaj Sahni, Reconfigurable mesh algorithms for image shrinking, expanding, clustering, 
and template matching in: Proc. 5th Internat. Parallel Processing Symp., Anaheim, CA, 1991, pp. 
208-275. 
[IO] A. Karlin, E. Upfal, Parallel Hashing-an efficient implementation of shared memory, SIAM J. 
Comput. 53(4) (1988) 876-892. 
242 B-F. Wang/ Theoretical Computer Science 205 (1998) 231-242 
[ll] L. Kucera, Parallel computation and conflicts in memory access, Inform Process. Lett. 14 (1982) 
93396. 
[12] H. Li, M. Maresca, Polymorphic-torus network, IEEE Trans. Comput. CC-38 (9) (1989) 1345-1351. 
[13] R. Lin, S. Olariu, Reconfigurable buses with shift switching-concepts and applications, IEEE Trans. 
Parallel and Distributed Systems 6 (1995) 93-102. 
[14] R. Lin, S. Olariu, J.L. Schwing, J. Zhang, Sorting in O(1) time on an n x n reconfigurable mesh, in: 
Proc. European Workshop on Parallel Computing, 1992. 
[15] R. Miller, V.K. Prasanna Kumar, D.I. Reisis, Q.F. Stout, Parallel computations on reconfigurable 
meshes, IEEE Trans. Comput. 42(6), (1993) 6788692. 
[16] K. Nakano, A bibliography of published papers on dynamically reconfigurable architectures, Parallel 
Process. Lett. 5 (1995) 111-124. 
[17] K. Nakano, T. Masuzawa, N. Tokura, A sub-logarithmic time sorting algorithm on a reconfigurable 
array, IEICE Trans. (Japan) E-74 (11) (1991) 389443901. 
[18] A.G. Ranade, How to emulate shared-memory, in: Proc. 28th Ann. Sympos. on Foundations of 
Computer Science, 1987, pp. 1855192. 
[19] B.F. Wang, G.H. Chen, Constant time algorithms for the transitive closure and some related graph 
problems on processor arrays with reconfigurable bus systems, IEEE Trans. Parallel and Distributed 
Systems l(4), (1990) 500-507. 
[20] B.F. Wang, G.H. Chen, Two-dimensional processor array with a reconfigurable bus system is at least 
as powerful as CRCW model, Inform. Process. Lett. 36(l), (1990) 31-36. 
