Simulating Shared Memory in Real Time: On the Computation Power of Reconfigurable Architectures  by Czumaj, Artur et al.
File: 643J 264201 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3891 Signs: 2009 . Length: 58 pic 2 pts, 245 mm
Information and Computation  IC2642
information and computation 137, 103120 (1997)
Simulating Shared Memory in Real Time:
On the Computation Power of
Reconfigurable Architectures*, -
Artur Czumaj and Friedhelm Meyer auf der Heide
Heinz Nixdorf Institute and Department of Computer Science, University of Paderborn,
D-33102 Paderborn, Germany
E-mail: [artur, fmadh]uni-paderborn.de
and
Volker Stemann
International Computer Science Institute, Berkeley, California 94704-1198
E-mail: stemannicsi.berkeley.edu
We consider randomized simulations of shared memory on a distributed
memory machine (DMM) where the n processors and the n memory
modules of the DMM are connected via a reconfigurable architecture. We
first present a randomized simulation of a CRCW PRAM on a recon-
figurable DMM having a complete reconfigurable interconnection. It
guarantees delay O(log *n), with high probability. Next we study a
reconfigurable mesh DMM (RM-DMM). Here the n processors and n
modules are connected via an n_n reconfigurable mesh. It was already
known that an n_m reconfigurable mesh can simulate in constant time
an n-processor CRCW PRAM with shared memory of size m. In this paper
we present a randomized step by step simulation of a CRCW PRAM with
arbitrarily large shared memory on an RM-DMM. It guarantees constant
delay with high probability, i.e., it simulates in real time. Finally we prove
a lower bound showing that size 0(n2) for the reconfigurable mesh is
necessary for real time simulations. ] 1997 Academic Press
article no. IC972642
103 0890-540197 25.00
Copyright  1997 by Academic Press
All rights of reproduction in any form reserved.
* Supported by DFG-Graduiertenkolleg ‘‘Parallele Rechnernetzwerke in der Produktionstechnik,’’
ME 8724-1, by DFG-Sonderforschungsbereich 376 ‘‘Massive Parallelita t: Algorithmen, Entwurfs-
methoden, Anwendungen,’’ by DFG Leibniz Grant Me8726-1.
- This paper is a significantly revised version of results that appeared in (Czumaj et al., 1995a, 1995b).
 Work done at the University of Paderborn.
File: 643J 264202 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3768 Signs: 3386 . Length: 52 pic 10 pts, 222 mm
1. INTRODUCTION
The parallel random access machine (PRAM) is an idealized model for parallel
computation. It strips away problems that result from synchronization, latency,
memory contention, communication capacity and reliability. Therefore, it is very
comfortable to design parallel algorithms because the programmer does not have to
deal with hardware limitations. But on the other hand it is very unrealistic from the
technological point of view. In this paper we deal with the issue of efficient realiza-
tion of the shared memory of the PRAM. With current technology a parallel shared
memory can only be realized for a small number of processors. A more realistic
parallel computation model is the distributed memory machine (DMM). Here the
memory is distributed among a limited number of memory modules, and the
processors and memory modules are connected via a routing interconnection
network. In this paper we study DMMs with n processors and n modules.
In an effort to understand the relative power of the PRAM compared with other
parallel computation models several authors described simulations between them
(Upfal, 1984; Karlin and Upfal, 1986; Wang and Chen, 1990; Ranade, 1991;
Leighton, 1992a, 1992b; Karp et al., 1996; Dietzfelbinger and Meyer auf der Heide,
1993; Meyer auf der Heide et al., 1996; Czumaj et al., 1995d). For example, it is
known that the n-processor PRAM can be simulated (with high probability) with
O(log n) delay on the butterfly networks (Ranade, 1991), with O(log log n) delay on
the optical communication parallel computer (Goldberg et al., 1994) and with
O(log log log n log* n) delay on the DMM with a complete interconnection network
between processors and modules (Czumaj et al., 1995c, 1995d).
In recent years interest in parallel computation models based on reconfigurable
architectures has rapidly grown. As very powerful and physically realizable models
of parallel computation (see (Li and Stout, 1991; ElGindy and Prasanna, 1995))
reconfigurable networks have become object of extensive research (Wang and
Chen, 1990; Li and Stout, 1991; Olariu et al., 1993; Ben-Asher et al., 1995).
Many fundamental operations and problems on this model, especially on the recon-
figurable mesh, have been considered, such as data reduction, ranking, sorting, and
parity. Ben-Asher et al. (1995) studied the parallel complexity of reconfigurable
network models. They examined the computational power of such models by
focusing on the set of problems computable in constant time on some variants of
the model.
In this paper we investigate relations between the PRAM and DMMs with
reconfigurable networks as routing mechanisms. All results stated in the following
are randomized and hold with high probability (w.h.p.), i.e., with probability at least
1&n&: for any constant :>1. We focus on simulations that minimize the delay,
i.e., the time needed to simulate a parallel memory access of a PRAM on a DMM.
Furthermore, we are interested in optimal simulations. We say a simulation of a
p-processor PRAM on an n-processor DMM is timeprocessor optimal if the delay
is O( pn).
The first, very powerful model we analyze is the reconfigurable DMM (R-DMM),
where the routing network is a complete reconfigurable network. This model can be
viewed as a DMM where the processors and modules are connected by a complete
104 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264203 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3584 Signs: 3140 . Length: 52 pic 10 pts, 222 mm
graph (so-called Standard-DMM) with the additional facility of combining links
to buses. In each step of the R-DMM each processor can combine two adjacent
links into one and then read from or write into this new link. This defines us paths
and cycles that form buses which can be used for broadcasting.
A further step to achieve a more realistic model is to assume, instead of a
complete network, a reconfigurable mesh as a routing network. (See Li and Stout
(1991) for more motivation behind this model of the reconfigurable mesh). The
model of the reconfigurable mesh DMM (RM-DMM) takes into account the issue
of memory contention and assumes a technologically feasible interconnection
network between processors and modules. The interconnection network of the
n-processor RM-DMM is formed by an n_n reconfigurable mesh. In a recon-
figurable mesh each node can combine in each step pairs of adjacent links together
such that the combined links create buses. As in the R-DMM, each processor can
read from or write to adjacent buses and use them for broadcasting. Wang and
Chen (1990) presented a deterministic simulation of a CRCW PRAM with n
processors and m shared memory cells by the m_n processor array with a recon-
figurable bus system with constant delay. This result requires a large hardware
overhead to simulate a large shared memory. We note, however, that it enables to
simulate an n-processor Standard-DMM deterministically on an n-processor
RM-DMM with constant delay.
Very recently, independently to our work, Matias and Schuster (1995) presented
a PRAM simulation on an n processor variant of an RM-DMM with
O(log log log n) delay using a result from (Czumaj et al., 1995d). They assume a
weaker collision resolution rule for concurrent access of processors to a bus than
we do in this paper.
1.1. Outline of Results
Our PRAM simulations follow the idea of hashing the shared memory of the
PRAM into the modules of the DMM, as in (Karp et al., 1996; Dietzfelbinger and
Meyer auf der Heide, 1993; Goldberg et al., 1994; MacKenzie et al., 1994). One or
more copies of the shared memory cells are distributed among the memory modules
using a constant number of hash functions. The hash functions are chosen
uniformly at random from a high performance log2 n-universal class of hash
functions (see, e.g., (Siegel, 1989; Karp et al., 1996; Czumaj et al., 1995d)). To
achieve a consistent simulation, the majority technique due to Upfal and Wigderson
(1987) is used. It ensures that it suffices to access always a majority of all copies of
a key to get a consistent shared memory simulation.
Our first result is a simulation of an n-processor CRCW PRAM on an
n-processor reconfigurable DMM (R-DMM) with O(log* n) delay, w.h.p.. This
result compares favorably with the best known PRAM simulation on the
Standard-DMM that has delay O(log log log n log* n), w.h.p. (Czumaj et al.,
1995d). The simulation by an R-DMM can be made timeprocessor optimal for
EREW PRAMs.
Our second result shows that an n-processor RM-DMM is as powerful as an
n-processor CRCW PRAM. More precisely we present a step by step simulation
105SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264204 . By:XX . Date:12:08:01 . Time:03:36 LOP8M. V8.0. Page 01:01
Codes: 2708 Signs: 2207 . Length: 52 pic 10 pts, 222 mm
that performs in real time, i.e., guarantees constant delay for each simulated PRAM
step, w.h.p. Hence we combine the advantages of the simulations of Wang and
Chen (1990) and Czumaj et al. (1995d) and significantly improve the result of
Matias and Schuster (1995). The main idea of the simulation is to transfer the
O(log* n)-delay simulation on the R-DMM to the RM-DMM and then redesign all
non-constant-time steps.
The paper is organized as follows. In Section 2 we proceed with the precise
definitions of the computation models. Section 3 presents the general idea of
hashing based PRAM simulations and states two graph theoretic lemmas from
(Czumaj et al., 1995d) that are the basis for the analysis of our algorithms.
Section 4 gives two algorithmic tools used in the simulations. In Section 5 we
present a simulation of a CRCW PRAM on an R-DMM. Finally, Section 6
contains the real-time simulation of a CRCW PRAM on an RM-DMM and shows
the optimality of this result.
2. COMPUTATION MODELS
A parallel random access machine (PRAM) consists of p processors P1 , ..., Pp and
a shared memory with cells U=[1, ..., m]. The processors work synchronously and
have random access to the shared memory cells, each of which can store an integer.
We consider two models of the PRAM, an exclusive read exclusive write (EREW)
PRAM, in which concurrent reads and writes are forbidden, and a concurrent read
concurrent write (CRCW) PRAM, which allows concurrent reads and writes.
Among many variants of the CRCW PRAM model, we only deal with two variants
for solving conflicts if several processors want to write to the same shared memory
cell simultaneously, the Priority CRCW PRAM, in which the processor with the
highest priority succeeds, and the weaker Arbitrary CRCW PRAM, where an
arbitrary processor succeeds.
A distributed memory machine (DMM) has n processors Q1 , ..., Qn connected via
a routing network with a distributed memory consisting of n memory modules
M1 , ..., Mn (see Fig. 1). A module has a communication window and can read from
or write into its window. From the point of view of the processors, a window acts
FIG. 1. Generic model of the distributed memory machine.
106 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264205 . By:XX . Date:12:08:01 . Time:03:36 LOP8M. V8.0. Page 01:01
Codes: 3021 Signs: 2580 . Length: 52 pic 10 pts, 222 mm
like a shared memory cell, where concurrent accesses are allowed. If more than one
processor wants to access the same module simultaneously then an arbitrary one
succeeds. This is the same conflict resolution rule mentioned above for the
Arbitrary CRCW PRAM.
If we specify the routing network we can distinguish between several models. If
the routing network is a complete network we call the model the Standard-DMM
as introduced by Karp et al. (1996). Note that an n-processor Standard-DMM
can be simulated with constant delay on an n-processor Arbitrary CRCW PRAM
with O(n) shared memory cells and vice versa.
By adding the capability of reconfiguration to the complete network as routing
network, we get the reconfigurable distributed memory machine (R-DMM). Roughly
speaking, the capability of reconfiguration allows a processor to combine two
adjacent links to other processors into a bus. Because in the bipartite graph there
are no direct links between the processors we identify processor Qi with module Mi ,
for i=1, ..., n. Hence, the complete bipartite graph between the processors and
modules can be viewed as a complete network connecting the processormodule
pairs. Thus, we can view each processor as having a link to all other processors.
Each processor Ql can combine a link to Qk with a link to Qm into a bus. These
combined links are viewed as (hardware) connected. Hence, such combined links
are building blocks for larger bus components. These buses are restricted to
node-disjoint cycles or paths (see Fig. 2).
The R-DMM dynamically reconfigures itself at each time step. Each processor of
the RDMM acts locally in each step combining two adjacent links into one. In each
step of the R-DMM one or more processors connected by a bus can try to transmit
a message on the bus. If more than one do, an arbitrary one succeeds. This is the
same Arbitrary write conflict resolution rule as described for the Standard-
DMM. All processors connected by the bus can read the message transmitted on
the bus. Clearly, this means that it is possible to broadcast information in one step
to more than one processor. The basic assumption concerning the behavior of the
reconfigurable model is that the time to transmit a message along any bus is
constant, regardless of the length of the bus.
If we use an n_n reconfigurable mesh for the topology of the routing network,
we get the n-processor reconfigurable mesh DMM (RM-DMM). The reconfigurable
FIG. 2. The RDMM with a bus established between the first, the third, and the eight processor.
107SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264206 . By:XX . Date:12:08:01 . Time:03:36 LOP8M. V8.0. Page 01:01
Codes: 2779 Signs: 2132 . Length: 52 pic 10 pts, 222 mm
FIG. 3. A 4-processor reconfigurable mesh DMM.
mesh (Li and Stout, 1991) consists of a two-dimensional mesh in an n_n square
grid with one switch per grid point and a reconfigurable bus system. Each switch is
connected to the reconfigurable bus system through four ports, denoted by N, S, W,
and E. The configuration of the bus system can be changed by connecting different
pairs of ports within each switch. Hence, the global reconfiguration is a partition
of the network into edge-disjoint paths and cycles. The computational power of a
switch is very limited. It can store one integer and can perform only very simple
computations: basic operations on two numbers, change of connections between
ports, read from or write to the bus it is connected with.
In the RM-DMM, the n processors are assigned to the first column and the n
memory modules to the first row of the mesh. An example of a 4-processor
RM-DMM is shown in Fig. 3. Each processor and each switch can communicate
with other switches and processors by broadcasting a message through the bus. All
processors and switches connected with the bus can read the message. If more than
one try to send messages on a bus, an arbitrary one of them succeeds. Again, this
is the same Arbitrary write conflict resolution rule as described for the Standard-
DMM.
3. SIMULATION TECHNIQUES
Our description is based on the approach presented by Czumaj et al. (1995d). We
consider shared memory simulations on a DMM that are based on hashing. In a
preprocessing phase each processor Pi of the PRAM is mapped to processor Qi of
the DMM. The memory of the PRAM is hashed using three hash functions
h1 , h2 , h3 : U  [1, ..., n]. Each memory cell u # U of the PRAM (we say key for
short) will be stored in the modules Mh1 (u) , Mh2 (u) , and Mh3 (u) of the DMM. We
will call the representations of u in the Mhi (u) ’s the copies of u.
A class of hash functions H mapping U into [1, ..., n] is k-universal (Carter and
Wegman, 1979), if for each u1< } } } <uj # U, l1 , ..., lj # [n], jk, and the hash
function h drawn with uniform probability from Hm, n , then
Pr(h(u1)=l1 , ..., h(uj)=lj)
2
n j
.
108 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264207 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 4028 Signs: 3584 . Length: 52 pic 10 pts, 222 mm
For our purposes we require a log2 n-universal class of hash functions H, such that
a random h # H can be constructed fast, stored using little space, and evaluated in
constant time. For example, we can use a - n-universal class of hash functions
described by Siegel (1989), or a class developed by Karp et al. (1996).
For the simulation of a PRAM step we use the majority technique due to Upfal
and Wigderson (1987). It ensures that it suffices to access arbitrary two out of the
three copies of a shared memory to guarantee a correct simulation. Each copy of
a key contains a time stamp indicating the update time. To write to a memory cell
a processor of the DMM accesses at least two of the copies, updates them and adds
a time stamp to them indicating the (PRAM-) time of the update. To read a
memory cell a processor has to access two of the copies. This guarantees that at
least one up-to-date copy is accessed. It can be recognized by its time stamp.
We modify this two out of three idea and split the schedule into three steps of
trying to access one out of two copies with a different pair of hash functions in each
step. In this way we always access at least two different copies out of the three
possible. Using the majority technique we get a consistent simulation. Therefore in
the following we will focus on the analysis of accessing one out of two possible
copies of a shared memory cell, i.e., an access schedule that uses two hash functions
h1 and h2 . Let us call such a schedule a one-out-of-two-schedule.
For technical reasons, we do not perform all n accesses to the shared memory
simultaneously but split the requests into batches of size n22c+6, for some constant
c1 to be specified later. Since we only have a constant number of batches, this
will slow down our algorithm only by a constant factor. We will focus in this
section only on the requests to the memory of an EREW PRAM, so that all
requested PRAM memory cells are pairwise distinct. In Section 4.2 we describe a
result that enables us to generalize our simulations to the CRCW PRAM.
Let S denote a batch of n22c+6 requests to the memory of the PRAM and let
h1 and h2 be chosen uniformly at random from the log2 n-universal class of hash
functions H. Let H=([1, ..., n], E) be the labeled undirected graph defined by
h1 , h2 and the set of requests S. The nodes are the memory modules of the DMM.
For each u # S there is an edge (Mh1 (u) , Mh2 (u)) labeled u in H. Note that parallel
edges and self-loops are allowed in H, however all labels are disjoint.
One can view the one-out-of-two-schedule as the following process on the
graph H. Each processor that wants to access a shared memory cell u # U asks in
each step either Mh1(u) or Mh2 (u) . This corresponds to directing the edge labeled u
in H to Mh1(u) or Mh2 (u) , respectively. Then, if a module Mj answers the request to
cell u, the edge labeled u is removed from H. Summarizing, we direct in each step
every edge in H and then every node removes one edge (if any) that points to it.
Before the next step starts, the orientations from the remaining edges are erased.
The simulation ends when all the edges from H are removed.
Note that, initially, each processor only knows one edge of H, namely the edge
labeled with its request. Because the hash functions are chosen uniformly at random
from a log2 n-universal class of hash functions, H has similar properties as a
random labeled multigraph with n nodes and n22c+6 edges, where the edges are
chosen with repetitions and independently at random. Here we use the assumption
109SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264208 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3141 Signs: 2482 . Length: 52 pic 10 pts, 222 mm
that all elements from S are pairwise disjoint. The simulation we present relies on
these properties of the graph H.
Define the size of a connected component C, denoted by |C|, to be the number
of nodes it contains. We restate the following two lemmas proved by Czumaj et al.
(1995d).
Lemma 3. Let l and c be arbitrary positive constants, and let H be the access
graph with n nodes and n22c+6 edges. There exists a constant w1 such that for
sufficiently large n
(a) Pr(H has a connected component of size at least (lc) log n)n&l,
(b) Pr(H has a connected component C with at least |C|+w&1 edges)n&l.
This lemma states that, with high probability, each connected component in H
is a tree of size O(log n) with a constant number of additional edges.
We note that Lemma 3.1 implies the existence of a constant-time algorithm for
removing all edges in H. Let C be a connected component in H and let T be an
arbitrary spanning tree of C. By Lemma 3.1, with high probability only a constant
number of edges in C does not belong to T. Fix one node r in T and make it the
root of T. Direct all edges in T towards the leaves and all other edges in C in an
arbitrary way. Because there is only a constant number of edges in C that does not
belong to T, only a constant number of edges in C will not be removed after the
step. Thus a constant number of steps is needed to remove all the edges in C. For
the future reference we will call such a schedule an off-line schedule.
In this paper we develop strategies that start with each processor exploring the
whole connected component it belongs to. This will be used to compute the off-line
schedule mentioned above. In order to explore the components efficiently (i.e., in
constant time or O(log* n) time on our DMM models) we need to be able to assign
to each node or edge of a connected component C of the access graph a number
of processors exponential in the size of C. The next lemma shows that we have
sufficiently many processors to do so.
Lemma 3.2. Let H be the access graph with n nodes and n22c+6 edges for
some constant c1 defined as in Lemma 3.1. For all constants ; and l such that
c2l(;+2), with probability at least 1&2(1n) l&1,
:
|C |>1
connected components
|C | } 2 |C | ;n.
4. ALGORITHMIC TOOLS
4.1. Log-Star- and Constant-Time Algorithms
In this section we outline main algorithmic tools used by our algorithms.
If an array of size 2n contains at least n objects, we will call the array padded
consecutive.
110 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264209 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3213 Signs: 2514 . Length: 52 pic 10 pts, 222 mm
Given n integers x1 , x2 , ..., xn , the strong semisorting problem (Bast and
Hagerup, 1995) is to store them in a padded-consecutive array so that all variables
with the same value occur in a padded-consecutive subarray.
Given n bits x1 , x2 , ..., xn , the chaining problem (Berkman and Vishkin, 1993;
Ragde, 1993) is to find for each xi the nearest 1’s both to its left and to its right.
Given m tasks distributed among n processors, the processor allocation problem
is to redistribute the tasks so that each processor gets O(WmnX) tasks.
As we mentioned in the last section, the n-processor DMM is essentially
equivalent to the n-processor Arbitrary CRCW PRAM with O(n) shared memory.
Hence we can use algorithms designed for the CRCW PRAM to obtain the
following lemma (for the proof see (Czumaj et al., 1995d)).
Lemma 4.1. The following problems can be solved on the n-processor
Standard-DMM (and therefore also on the R-DMM) in O(log* n)-time with
probability at least 1&2&n= for some constant =>0:
(1) strong semisorting
(2) chaining
(3) processor allocation.
Sorting is a very important and comfortable subroutine that we use it in many
places in our algorithms for the RM-DMM. Olariu et al. (1993) obtained the
following result for integer sorting.
Lemma 4.2. A sequence of n integers in the range from 0 to nc for a constant c
can be sorted deterministically in constant time on an n-processor RM-DMM.
We will also use the following lemma given in (Wang and Chen, 1990; Ben-Asher
et al., 1991).
Lemma 4.3. Each step of a Priority CRCW PRAM with n processors and n
memory cells can be deterministically simulated in constant time by an n-processor
RM-DMM.
4.2. Reduction from the CRCW PRAM to the EREW PRAM
In Section 3 we assumed that an EREW PRAM is to be simulated and thus the
elements in S are pairwise disjoint. In this subsection we present a reduction that
allows us to focus only on an EREW PRAM to be simulated. Czumaj et al. (1995d)
showed the following.
Lemma 4.4. If an n-processor EREW PRAM can be simulated on an n-processor
Standard-DMM with delay {, w.h.p., then an n-processor Priority CRCW PRAM
can be simulated on an n-processor Standard-DMM with delay O({+log* n), w.h.p.
This result suffices for our simulation on an R-DMM. To achieve a constant-time
simulation on an RM-DMM we develop a stronger reduction with constant delay
for the RM-DMM.
Suppose that each processor Pi of the CRCW PRAM wants to access memory
cell +i # U. Let q be a prime, qm, and s2. Choose randomly two integers
111SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264210 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3466 Signs: 2852 . Length: 52 pic 10 pts, 222 mm
a, b # [0, ..., q&1] and define a function h(x)=(a+bx mod q) mod s, for x # U.
The following lemma is well known (see, e.g., (Dietzfelbinger et al., 1994)).
Lemma 4.5. Let h : U  [0, ..., 2n#+2&1] be a random hash function defined
above for s=2n#+2, then for any XU, |X|n
Pr( max
0i<s
[ |h&1(i) & X |]=1)1&n&#.
Observe that a function h can be stored in O(1) cells and can be generated and
evaluated in constant time by one processor. After choosing a and b by one
processor, h can be distributed to all other processors using two broadcasting steps
on the RM-DMM. Then each processor can evaluate h in constant time.
In order to extend our simulations to CRCW PRAMs we have to show how we
will deal with duplicate requests to the same memory cell. The idea is first to sort
the requests and then to remove duplicates. Because we want to proceed in constant
time, we cannot use general sorting and we use rather the constant-time integer
sorting algorithm from Lemma 4.2. In order to apply this algorithm, we map all
requests (using a random hash function h from Lemma 4.5) into an interval of
integers from 0 to 2n#+2&1. Fix # to be a constant in Lemma 4.5 so that the
required probability of the success 1&n&# is large enough. We perform integer
sorting on pairs (h(+1), 1), ..., (h(+n), n) in constant time on the n-processor
RM-DMM, by Lemma 4.2. Then, by Lemma 4.5, with high probability two values
h(+i) and h(+j) are equal only if +i=+j . Hence, with high probability the addresses
with the same value are stored in a contiguous subsequence of the sorted sequence.
We choose the first element from each such contiguous subsequence, call it the
leader of this subsequence, and then proceed only with it as in the EREW PRAM
case. Finally the leader has to broadcast the answer to read requests to the
duplicates. The broadcasting can also be performed in constant time on an
RM-DMM. Hence we obtain the following lemma.
Lemma 4.6. If one can simulate an n-processor EREW PRAM on an n-processor
RM-DMM with delay {, then one can also simulate an n-processor Priority CRCW
PRAM on an n-processor RM-DMM with delay O({), w.h.p.
5. AN O(log* n)-DELAY SIMULATION ON AN R-DMM
In this section we show a nearly constant-time simulation of a PRAM by adding
the power of reconfiguration to the Standard DMM model. We achieve a delay
of O(log* n) for a simulation of an n-processor EREW PRAM on an n-processor
R-DMM. Using Lemma 4.4, this result extends to CRCW PRAMs.
Given an access graph H, assume the properties of Lemmas 3.1 and 3.2. Our
algorithm first finds the decomposition of H into its connected components, and
then works on each of them independently. For each component we perform in
parallel a lot of virtual access experiments to find the best schedule, which is in fact
a constant time off-line schedule. Finally, we execute this schedule.
112 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264211 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3596 Signs: 3039 . Length: 52 pic 10 pts, 222 mm
A high-level description of the algorithm is as follows:
Simulation R-DMM.
(1) All vertices in each connected component C agree on one representative,
called the leader.
(2) In each connected component C, allocate an exponential number (in the
size of C) of processors to C.
(3) Perform parallel experiments to find a constant time off-line schedule.
(4) Perform this off-line schedule.
We now describe the steps and their implementations in detail.
Step 1. Our first goal is to achieve that the processors of each connected
component of H agree on a leader. We divide this step into four substeps. Step (1.1)
finds a decomposition of each connected component into a constant number of
‘‘Euler cycles.’’ In Step (1.2) we reconfigure the R-DMM according to the Euler
cycles and in Step (1.3) all edges from each cycle agree on one leader. In Step (1.4)
in each connected component all Euler cycles are combined into one Euler cycle
and a leader for the connected component is found.
Step 1.1. We replace each undirected edge (i, j) of H by two directed edges, in
opposite directions, [i, j] and [ j, i]. This guarantees that each component contains
an Euler cycle. We assign a processor to each directed edge (also called arc). Hence,
in the following we identify an arc with its processor.
The capability of reconfiguration is only used for finding leaders in the Euler
cycles. We order the arcs by the first coordinates, i.e., the nodes they want to access.
We use here the O(log* n)-time strong semisorting algorithm (Lemma 4.1). Then we
use the chaining algorithm (Lemma 4.1) to find, for each node v, its adjacency list.
Now all processors that want to access a node v are standing in a consecutive
adjacency array of v.
Next we perform the standard construction of Tarjan and Vishkin for computing
Euler cycles (see, e.g., (Ja Ja , 1992)). If [v, x] and [v, y] are two following arcs in
the adjacency list of v (or if [v, x] is the last and [v, y] is the first one), then arc
[v, y] follows arc [x, v] in the Euler tour decomposition.
If a given connected component is a tree then this construction defines its Euler
cycle. Otherwise, we obtain a decomposition of the arcs into at most s+1 Euler
cycles if a given connected component has k nodes and k+s&1 edges.
Step 1.2. Now we reconfigure the links between processors and modules of the
R-DMM according to the cycles. That is, if arc e1 precedes arc e2 and arc e2
precedes arc e3 in a cycle, then the processor assigned to arc e2 combines the link
to the processor assigned to arc e3 and the processor assigned to arc e1 . Hence, the
processors assigned to the arcs e1 , e2 , and e3 are connected via one bus. If we
look at the connected component, we have established the Euler cycle as a bus
connecting the processors assigned to the arcs of this Euler cycle.
Step 1.3. Now each arc sends its identifier through the assigned bus. Because of
the Arbitrary rule for conflict resolution on the bus, all arcs on each cycle get one
identifier, which is called the leader of the cycle.
113SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264212 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3468 Signs: 2733 . Length: 52 pic 10 pts, 222 mm
Step 1.4. Next we combine all the cycles within each connected component.
Each edge of H whose two directed arcs belong to different cycles, chooses the
smaller of their two leaders and sends the identifier of the assigned processor on the
cycle of the larger leader. If it succeeds (according to the arbitrary rule on a bus of
the R-DMM), it combines these two cycles swapping successors from the arcs
belonging to the edge.
Note that, by Lemma 3.1, there is no connected component C with more than
|C|+O(1) edgers, w.h.p. Therefore, we only have to perform a constant number of
combinings of cycles to join all of them into one Euler cycle for each connected
component. Thus we can perform this step in constant time, w.h.p., and we have
finally found a leader for each connected component.
Step 2. Let C be a connected component and let S(C) denote the number of
edges in C. Note that S(C)=O( |C| ) by Lemma 3.1. We use the strong semisorting
algorithm (Lemma 4.1) to group all edges of C in a subarray BC of size S (C),
S(C)S (C)2S(C) and then compute the approximate size using the chaining
algorithm (Lemma 4.1). Then we allocate in O(log* n) time for each edge in C
exactly 2S (C ) processors, w.h.p., using Lemma 4.1. Hence, altogether we allocate
:
connected components C, |C|>1
|C| } 2S (C ) :
connected components C, |C |>1
|C | } 2 |C | b
processors, for some constant b. We can do this within our resources because of
Lemma 3.1 and Lemma 3.2, which ensure that the total number of allocated
processors is linear. We can view these assignments as having given 2S (C ) DMMs
of size S (C) for each connected component C.
Step 3. We systematically test all 2S (C )=2|C |+O(1) orientations of the edges of
C in parallel. Thus, in O(log* n) time, using strong semisorting, we can compute
one with indegree at most s+1, if C contains not more than |C|+s edges. (Note
that by discussion at the end of Section 3 such an orientation does exist.)
Step 4. Apply the access protocol indicated by the orientations of the edges
described above, all accesses are done after s+1=O(1) iterations.
Summarizing, only Step 1.1, Step 2, and Step 3 of Simulation R-DMM need
O(log* n) time, w.h.p. All other steps can be done in constant time. Hence, we get
the following theorem.
Theorem 5.1. Simulation R-DMM simulates an n-processor EREW PRAM on
an n-processor R-DMM with O(log* n) delay, w.h.p.
This result can be extended in two directions. First, using Lemma 4.4 we can
reformulate the result for a CRCW PRAM:
Theorem 5.2. An n-processor CRCW PRAM can be simulated on an n-processor
R-DMM with O(log* n) delay, w.h.p.
Second, we can transform our simulation into a timeprocessor optimal
simulation. We use a result from (Czumaj et al., 1995d):
114 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264213 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3198 Signs: 2515 . Length: 52 pic 10 pts, 222 mm
Lemma 5.3. If there exists a simulation of an n-processor EREW PRAM on an
n-processor DMM based on a ‘‘1 out of 2’’ protocol with delay bounded by {, w.h.p.,
then, using a constant number of hash functions, an ({n)-processor EREW PRAM
can be simulated on an n-processor DMM with delay O({+log* n), w.h.p.
As Simulation R-DMM fulfills the assumptions of this lemma we achieve a
time-processor optimal simulation.
Theorem 5.4. An n log* n-processor EREW PRAM can be simulated on an
n-processor R-DMM with O(log* n) delay, w.h.p.
6. SIMULATION WITH CONSTANT DELAY ON A
RECONFIGURABLE MESH DMM
The reconfigurable model of parallel computation most widely studied in the
literature is the reconfigurable mesh. In this section we want to transfer the simula-
tion on an R-DMM with O(log* n) delay to a simulation with constant delay on
an n-processor RM-DMM.
For the simulation we use the algorithm Simulation R-DMM. We show that
each step of Simulation R-DMM can be performed in constant time on an
n-processor RM-DMM.
We proceed in two steps. First, we simulate the reconfigurable DMM with n
processors on an n-processor RM-DMM with constant delay. This yields that we
can perform any step of Simulation R-DMM that takes constant time on the
n-processor R-DMM (that is, Steps 1.2, 1.3, 1.4, and 4) in constant time on the
n-processor RM-DMM. Then we show how to perform steps of Simulation
R-DMM that take O(log* n) time on the R-DMM in constant time on the
RM-DMM. As a main tool we use an algorithm for sorting n integers on an n_n
reconfigurable mesh in constant time (Lemma 4.2).
6.1. Simulation Between Reconfigurable Architectures
The relationship between an n-processor R-DMM and an n-processor RM-
DMM is stated in the following lemma.
Lemma 6.1. Each step of an n-processor R-DMM can be simulated deterministi-
cally with constant delay on an n-processor RM-DMM.
Proof. First we show how to simulate the communication of the R-DMM that
does not use the reconfiguration on a reconfigurable mesh; i.e., we simulate read
and write steps of a Standard DMM. Then we extend this simulation with respect
to the capability of reconfiguration.
The simulation of a Standard-DMM (that is, the simulation of a step of the
R-DMM which does not use reconfiguration of the links) is equivalent to the
simulation of a CRCW PRAM with n memory cells, as mentioned in Section 2.
Therefore, we can use here the constant-time simulation of the n-processor CRCW
PRAM with n memory cells on the n_n-reconfigurable mesh from Lemma 4.3.
115SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264214 . By:XX . Date:12:08:01 . Time:03:37 LOP8M. V8.0. Page 01:01
Codes: 2482 Signs: 1822 . Length: 52 pic 10 pts, 222 mm
FIG. 4. Reconfiguration w.r.t. the buses 3&1&4&7&3 and 5&2&6&5.
It remains to show how to simulate the capability of reconfiguration of the
R-DMM, that is, the capability of each processor to combine two links into a bus.
Assume that a processor Ql of the R-DMM wants to combine the link from the
processors Qm to the processor Qk of the R-DMM. Let us denote the processors of
the RM-DMM by R1 , ..., Rn . Processor Ql of the R-DMM will be simulated by
processor Rl of the reconfigurable mesh DMM, 1ln.
At the beginning the ports of all switches are connected (NS) and (EW). Each
processor Rl performs the following steps:
v Rl configures the (l, m)-switch to (WS) and (NE).
v if k<l<m or k>l>m then
Rl connects in the (l, l)-switch the ports (NE) and (SW )
else Rl connects in the (l, l )-switch the ports (NW) and (SE).
An example of a bus connecting the processors 31473 and another bus
connecting the processors 5265 is given in Fig. 4.
Because the buses in the R-DMM are edge- and processor-disjoint, each link of
the reconfigurable mesh is only used once. A communication step on a bus consists
of two parts. First, the processor Rl simulating processor Ql of the R-DMM sends
the request to the (l, l )-switch. This can be done if all switches are connected
(WE). Then all switches are reconfigured with respect to the bus structure of the
bus of R-DMM to be simulated as described above. Now, the (l, l )-switches can
read from or write on the bus and send back the result to the processor Rl . For the
last step again all switches have to be configured (WE). K
6.2. Analysis of the Simulation on an RM-DMM
As the main result of this section, we obtain the following theorem.
Theorem 6.2. An n-processor Priority CRCW PRAM can be simulated on an
n-processor RM-DMM with constant delay, w.h.p.
116 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264215 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3807 Signs: 3319 . Length: 52 pic 10 pts, 222 mm
Proof. By Lemmas 4.6 and 6.1 it remains to show that we can perform each step
of Simulation R-DMM that needs non-constant time on the R-DMM in constant
time on the RM-DMM, that is, Step 1.1, Step 2, and Step 3.
Step 1.1. The algorithm for Step 1.1 is essentially the same as the one described
in Section 5. To find an Euler tour for each connected component we replace each
undirected (i, j) by two directed edges [i, j] and [ j, i] and assign a processor to
each directed edge. Now we use Lemma 4.2 to sort the directed edges with respect
to the first coordinate. This gives the adjacency list for each node. Next we can
perform the standard Euler Tour construction as described in Simulation
R-DMM. As a result every processor knows its successor in the Euler tour. If every
processor sends its ID to its successor, then every processor also knows its
predecessor.
Step 2. In the analysis of this step we use the knowledge about the structure of
the access graph, especially Lemma 3.2. First, we sort in constant time (Lemma 4.2)
the processors with respect to the identifier of the Euler cycle they belong to.
Because we also specified a leader, we can compute the size of each connected
component in constant time. Then we broadcast it to all members of a connected
component, using the bus in each connected component. We have to allocate to
each processor an exponential number (in the size of the connected component it
belongs to) of processors. Because we cannot use the O(log* n)-time allocation
algorithm as in the last section, we sort the processors on their number of requested
processors. Lemma 3.1 ensures that the numbers to be sorted are in the range of
O(n).
Lemma 3.1 gives an upper bound on the number of processors requesting at least
2k processors. With high probability, at most 4c+1(nk2ck) processors are in connec-
ted components of size at least k, and hence request at least 2k processors. We
allocate processors with respect to the distribution stated by Lemma 3.2. More
precisely this means that we allocate 2k processors, 1klog n, to the requesting
processors that are in the interval [ log ni=k+1 4
c+1(ni2ci), ( log ni=k 4
c+1ni2ci)&1] of
the sorted array. This deterministic allocation ensures that with high probability we
allocate always enough processors, and Lemma 3.2 also ensures that the total
number of allocated processors is linear. Altogether this step needs constant time.
Step 3. We perform the virtual access experiments on the copies of each
connected component in constant time using the sorting algorithm from
Lemma 4.2. It allows to sort in parallel the accesses in each copy of a connected
component in constant time and to determine a constant time off-line schedule as
described in the previous section. K
Remark. In the simulation it seems that we need at some places an O(n)-
processor RM-DMM. There are two ways to circumvent this problem. The first
way is to use a self-simulation as described in (Ben-Asher et al., 1993). It allows to
simulate an O(n)-processor RM-DMM on an n-processor RM-DMM with constant
delay. The second (and simpler) way is to set the value of c in Lemmas 3.1 and 3.2
such that each batch of size n22c+6 will not require to use more than n processors
and n modules. Then, we access the batches one after the other and need only an
117SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264216 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 3452 Signs: 2772 . Length: 52 pic 10 pts, 222 mm
n-processor RM-DMM for each batch. Therefore, we can assume a linear size of
the model without loss of generality.
6.3. Lower Bound for PRAM Simulations
In this section we provide necessary properties of the routing network of a DMM
that can simulate a PRAM with constant delay. In fact, we do not even consider
simulations of memory accesses. Rather we proceed as follows. Consider the
permutation problem: every processor Pi has a data item xi and a destination ?(i),
such that ? : [n]  [n] is a permutation. The goal is to send xi to P?(i) .
Clearly, an EREW PRAM can solve this problem in constant time. Thus, every
DMM that can simulate an EREW PRAM in constant time needs a routing
network that can route any arbitrary permutation in constant time.
Consider a (possibly reconfigurable) routing network N with m nodes. n of the
nodes are the processors and the others are the switches of the router. Let B(N)
be the bisection width of N, that is, the minimum number of links that have to be
removed in order to split the set of processors (not the set of nodes!) of N into two
parts of size wn2x and Wn2X , respectively.
We assume that each link can transmit one data item per time step (higher
bandwidth can be modeled using parallel edges).
We prove the following lower bound.
Theorem 6.3. Let N be an arbitrary ( possibly reconfigurable) network consist-
ing of n processors. Any (even randomized ) permutation routing algorithm on N
requires time 0(WnB(N)X).
Proof. Let removal of certain B(N) links in N split the set of processors
into two parts P1 and P2 of size wn2x and Wn2X , respectively. Let all the packets
from the processors in P1 have destinations in P2 and vice versa. Therefore
0(n) data items have to cross the bisection line in N. On the other hand, the
line is cut by only B(N) links. Hence, to ensure that all n data items will be
delivered, any algorithm that solves the permutation problem on N requires time
0(WnB(N)X). K
Observe the following consequence of Theorem 6.3.
Corollary 6.4. Let N be an a_b reconfigurable mesh with abn and ab.
Then any randomized simulation of an n-processor EREW PRAM on an n-processor
DMM with N as the routing network requires time 0(WnbX).
Proof. By Theorem 6.3 and the fact that an EREW PRAM can solve the
permutation problem in constant time, it is enough to show that B(N)=O(b). Let
j be the first row such that there are at least n2 processors assigned to the nodes
of N in rows 1, ..., j. We can partition the nodes of N into two parts X and Y, X
containing all the nodes at rows 1, ..., j&1 and some of the left-most nodes at
row j, and Y containing the remaining nodes, so that the number of processors
assigned to the nodes of X is the same (within one) as the number of processors
118 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
File: 643J 264217 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 6238 Signs: 2906 . Length: 52 pic 10 pts, 222 mm
assigned to the nodes of Y. Because it is enough to remove O(b) links to disconnect
X and Y, B(N)=O(b) and the corollary follows. K
This result implies that, in order to obtain a real-time PRAM simulation on an
DMM(N), with N being an a_b reconfigurable mesh, one must have a } b=0(n2).
ACKNOWLEDGMENT
We are grateful to Assaf Schuster for pointing out the problem of simulations on a reconfigurable
mesh.
Received February 22, 1996; final manuscript received March 25, 1997
REFERENCES
Bast, H., and Hagerup, T. (1995), Fast parallel space allocation, estimation and integer sorting, Inform.
and Comput. 123, 72110.
Ben-Asher, Y., Gordon, D., and Schuster, A. (1993), Efficient self simulation algorithms for
reconfigurable arrays, in ‘‘Proceedings of the 1st Annual European Symposium on Algorithms,’’
pp. 2536.
Ben-Asher, Y., Lange, K.-J., Peleg, D., and Schuster, A. (1995), The complexity of reconfiguring network
models, Inform. and Comput. 121, 4158.
Ben-Asher, Y., Peleg, D., Ramaswami, R., and Schuster, A. (1991), The power of reconfiguration,
J. Parallel Distrib. Comput. 13, 139153.
Berkman, O., and Vishkin, U. (1993), Recursive star-tree parallel data structure, SIAM J. Comput.
22(2), 221242. [A preliminary version appeared in ‘‘Proceedings of the 30th IEEE Symposium on
Foundations of Computer Science, 1989,’’ pp. 196202]
Carter, J. L., and Wegman, M. N. (1979), Universal classes of hash functions, J. Comput. System Sci.
18, 143154.
Czumaj, A., Meyer auf der Heide, F., and Stemann, V. (1995a), Improved optimal shared memory
simulations, and the power of reconfiguration, in ‘‘Proceedings of the 3rd Israel Symposium on
Theory of Computing and Systems,’’ pp. 1119.
Czumaj, A., Meyer auf der Heide, F., and Stemann, V. (1995b), Simulating shared memory in real time:
On the computation power of reconfigurable meshes, in ‘‘Proceedings of the 2nd IEEE Workshop on
Reconfigurable Architectures, Santa Barbara.’’
Czumaj, A., Meyer auf der Heide, F., and Stemann, V. (1995c), Shared memory simulations with
triple-logarithmic delay, in ‘‘Proceedings of the 3rd Annual European Symposium on Algorithms,’’
pp. 4659.
Czumaj, A., Meyer auf der Heide, F., and Stemann, V. (1995d), ‘‘Contention Resolution in Hashing
Based Shared Memory Simulations,’’ Technical Report TR-RSFB-96-005, SFB 376, University of
Paderborn, Germany.
Dietzfelbinger, M., Karlin, A., Mehlhorn, K., Meyer auf der Heide, F., Rohnert, H., and Tarjan,
R. E. (1994), Dynamic perfect hashing: Upper and lower bounds, SIAM J. Comput. 23(4), 738761.
Dietzfelbinger, M., and Meyer auf der Heide, F. (1993), Simple, efficient shared memory simulations,
in ‘‘Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures,’’
pp. 110119.
ElGindy, H., and Prasanna, V. K., Eds. (1995), ‘‘Proceedings of the 2nd Workshop on Reconfigurable
Architectures, IEEE, Santa Barbara, California.’’
119SIMULATING SHARED MEMORY IN REAL TIME
File: 643J 264218 . By:DS . Date:19:08:97 . Time:12:57 LOP8M. V8.0. Page 01:01
Codes: 6379 Signs: 2945 . Length: 52 pic 10 pts, 222 mm
Goldberg, L. A., Matias, Y., and Rao, S. (1994), An optical simulation of shared memory, in
‘‘Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms and Architectures,’’
pp. 257267.
Ja Ja , J. (1992), ‘‘An Introduction to Parallel Algorithms,’’ AddisonWesley, Reading, MA.
Karlin, A. and Upfal, E. (1986), Parallel hashingAn efficient implementation of shared memory, in
‘‘Proceedings of the 18th Annual ACM Symposium on Theory of Computing,’’ pp. 160168.
Karp, R. M., Luby, M., and Meyer auf der Heide, F. (1996), Efficient PRAM simulation on a distributed
memory machine, Algorithmica 16(45), 517542. [A preliminary version appeared in ‘‘Proceedings of
the 24th Annual ACM Symposium on Theory of Computing, 1992,’’ pp. 318326]
Leighton, F. T. (1992a), ‘‘Introduction to Parallel Algorithms and Architectures: Arrays, Trees,
Hypercubes,’’ Morgan Kaufmann, San Mateo, CA.
Leighton, T. (1992b), Methods for message routing in parallel machines, in ‘‘Proceedings of the 24th
Annual ACM Symposium on Theory of Computing,’’ pp. 7796.
Li, H., and Stout, Q. F., Eds. (1991), ‘‘Reconfigurable Massively Parallel Computers,’’ Prentice-Hall,
Englewood Cliffs, NJ.
MacKenzie, P. D., Plaxton, C. G., and Rajaraman, R. (1994), On contention resolution protocols and
associated probabilistic phenomena, in ‘‘Proceedings of the 26th Annual ACM Symposium on Theory
of Computing,’’ pp. 153162.
Matias, Y. and Schuster, A. (1995), Fast, efficient mutual and self simulations for shared memory and
reconfigurable mesh, in ‘‘Proceedings of the 7th IEEE Symposium on Parallel and Distributed
Processing,’’ pp. 238246.
Meyer auf der Heide, F., Scheideler, C., and Stemann, V. (1996), Exploiting storage redundancy to speed
up randomized shared memory simulations, Theoret. Comput. Sci. 162(2), 245281. [A preliminary
version appeared in ‘‘Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer
Science, 1996,’’ pp. 267278]
Olariu, S., Schwing, J. L., and Zhang, J. (1993), Applications of reconfigurable meshes to constant-time
computations, Parallel Comput. 19, 229237.
Ragde, P. (1993), The parallel simplicity of compaction and chaining, J. Algorithms 14, 371380.
Ranade, A. G. (1991), How to emulate shared memory, J. Comput. System Sci. 42, 307326.
Siegel, A. (1989), On universal classes of fast high performance hash functions, their timespace tradeoff,
and their applications, in ‘‘Proceedings of the 30th IEEE Symposium on Foundations of Computer
Science,’’ pp. 2025.
Upfal, E. (1984), Efficient schemes for parallel communication, J. Assoc. Comput. Mach. 31, 507517.
Upfal, E., and Wigderson, A. (1987), How to share memory in a distributed system, J. Assoc. Comput.
Mach. 34, 116127.
Wang, B.-F. and Chen, G.-H. (1990), Two-dimensional processor array with reconfigurable bus system
is at least as powerful as CRCW model, Inform. Process. Lett. 36, 3136.
120 CZUMAJ, MEYER AUF DER HEIDE, AND STEMANN
