Deterministic P-RAM simulation with constant redundancy  by Hornick, Scot W. & Preparata, Franco P.
INFORMATION AND COMPUTATION 92, 81-96 (199 1) 
Deterministic P-RAM Simulation 
with Constant Redundancy* 
SCOT W. HORNICK 
Andersen Consulting, Center for Strafegic Technology Research. 
100 S. Wacker Dr., Chicago, Illinois 60606 
AND 
FRANCO P. PREPARATA 
Department of Computer Science, Brown University, 
Box 1910, Providence, Rhode Island 02912 
In this paper, we show that distributing the memory of a parallel computer and, 
thereby, decreasing its granularity allows a reduction in the redundancy required to 
achieve polylog simulation time for each P-RAM step. Previously, realistic models 
of parallel computation assigned one memory module to each processor and, as a 
result, insisted on relatively coarse-grain memory. We propose, on the other hand, 
a more flexible, but equally valid model of computation, the distributed-memory, 
bounded-degree nerwork (DMBDN) model. This model allows the use of fine-grain 
memory while maintaining the realism of a bounded-degree interconnection 
network. We describe a P-RAM simulation scheme, which is admitted under the 
DMBDN model, that exploits the increased memory bandwidth provided by a two- 
dimensional mesh of trees (2DMOT) network to achieve an overhead in memory 
redundancy lower than that required by other fast, deterministic P-RAM simula- 
tions. Specifically, for a deterministic simulation of an n-processor P-RAM on a 
bounded-degree network, we are able to reduce the number of copies of each 
variable from O(log n/log log n) to e(l) and still simulate each P-RAM step in 
polylog time. !J  1991 Academic Press, Inc. 
1. INTRODUCTION 
Considerable research has been devoted to developing general-purpose 
architectures that exploit the parallelism offered by modem integration 
technology. A popular theoretical approach to this problem has been the 
design of processor networks for the simulation of abstract models of com- 




Copyright cjj 1991 by Academic Press, Inc. 
All rights of reproductmn in any form reserved. 
82 HORNICK AND PREPARATA 
putation, such as the parallel, random-access machine (P-RAM) model 
(Upfal, 1984; Upfal and Wigderson, 1987; Mehlhorn and Vishkin, 1984; 
Karlin and Upfal, 1986; Alt et al., 1987; Ranade, 1988; Luccio et al., 1988, 
1990). The P-RAM model of computation, formalized by Fortune and 
Wyllie (1978) and used even earlier by Hirschberg (1977) and Preparata 
(1977) has been a valuable tool for theoretical computer scientists studying 
the power and fundamental limitations of parallelism (see (Karp and 
Ramachandran, 1988) for a survey of results). By assuming the existence of 
a shared memory accessible to all processors in O(1) time, the P-RAM 
model trivializes the problem of inter-processor communication to reveal 
the inherent parallelism in a problem and facilitate the development of 
parallel algorithms. 
Formally, a P-RAM consists of n sequential processors (RAMS) and m 
shared memory cells (Fig. 1). These processors operate synchronously and, 
at each step, each one fetches an instruction from a private RAM and 
executes it. Executing these instructions may require accesses to the shared 
memory, and, in particular, the processors may all simultaneously read 
from or write to the shared memory at any given step. Several variants of 
the P-RAM have been defined, each differing in the convention applied to 
handle read/write conflicts, i.e., attempts by more than one processor to 
access the same memory cell in the same step. P-RAMS are either exclusive- 
read (ER) or concurrent-read (CR) and either exclusive-write (EW) or con- 
current-write (CW). The most restrictive of these is the EREW P-RAM in 
which no memory cell may be accessed by more than one processor in 
a given step. The least restrictive is the CRCW P-RAM, in which 
simultaneous reading and writing of memory cells is allowed, with some 
rule defining the exact semantics of simultaneous writes. 
Of course, the P-RAM model is not technologically feasible for a large 
number of processors. Therefore, research has been directed toward 
simulating the P-RAM model on more realistic models of computation, 
models that account for communication costs. The two most common 
among these are the module parallel computer (MPC) model (Mehlhorn 
and Vishkin, 1984) and the bounded-degree network (BDN) model (Alt et 
al., 1987). The MPC model takes into account the fact that it is not feasible 
shared memory 
FIG. 1. The P-RAM model of computation. 
DETERMINISTIC P-RAM SIMULATION 83 
complete network 
FIG. 2. The MPC model of computation. 
for any more than a constant number of processors to simultaneously 
access the same memory module. It consists of IZ RAM processors, each 
equipped with a memory module containing m/n memory cells, and all 
interconnected by the complete graph K,, (Fig. 2). However, this model 
itself is not feasible because the complete graph interconnecting the 
processors cannot be realized without unbounded fan-in or fan-out. This 
led to the consideration of the BDN model, in which each processor is 
linked directly to only a constant number of other processors (Fig. 3). 
Mehlhorn and Vishkin (1984) showed that the MPC model can 
probabilistic simulate T steps of a P-RAM in 0( Tlog n) time by using 
universal hashing and by increasing the capacity of each memory module 
to O(m/n log n). Upfal (1984) proved a similar result, and gave an 
0( T log’ n) time probabilistic simulation on a BDN. This result was subse- 
quently improved by Karlin and Upfal (1986), who described a O(T log n) 
time probabilistic simulation on a BDN (which is optimal with respect to 
time), and by Ranade (1987), who reduced the size of the queues used in 
the simulation from O(log n) to O(1). 
The first reasonable deterministic P-RAM simulation on an MPC 
was that of Upfal and Wigderson (1987), which uses O(logn) time- 
stamped copies of each variable to simulate each P-RAM step in 
O(log n(log log r2)2) (assuming m is polynomial in n). Alt et al. (1987) 
improved this upper bound to O(log m) time and used this simulation 
along with a sorting network to give an O(log n log nz) time simulation 
on a BDN. They also proved a lower bound of R(min{,/&& 
M, . . . Pn 
bounded-dew network 
FIG. 3. The BDN model of computation. 
84 HORNICK AND PREPARATA 
log n log m/log log m}) on the time required to simulate a P-RAM step on 
a BDN if all communication is required to be point-to-point, i.e., if a 
processor has to send a separate message to update each copy of a variable. 
(The same result was obtained independently in (Karlin and Upfal, 1986)) 
Recently, Herley and Bilardi (1988) achieved this time lower bound and 
reduced the redundancy r, i.e., the required number of copies of each 
variable, to r = @(log m/log log m) by using bounded-degree networks 
based on certain expander graphs. 
Luccio et al. have recently suggested the two-dimensional mesh of trees 
(2DMOT) as a practical bounded-degree network for the simulation of 
P-RAMS with m polynomial in n. In (Luccio et al., 1988), they proposed 
this network for the probabilistic simulation of P-RAMS, and in (Luccio et 
al., 1990) they proposed it for deterministic P-RAM simulation. The 
2DMOT was originally proposed by Nath et al. (1983) (where it was 
referred to as the “orthogonal trees” network) as an appropriate VLSI 
architecture for computing matrix-vector products and for a variety of 
other related matrix and graph problems. For n a power of 2, an (n x n)- 
ZDMOT network consists of n2 processors P(i, j), for integral i, Jo [ 1, n], 
and two families of n fully balanced binary trees connecting them as 
follows: 
(1) Row trees: RT( i) with the processors P( i, j) for j E [ 1, n] as their 
leaves, and 
(2) Column trees: CT(j) with the processors P(i, j) for ie [l, n] as 
their leaves (Fig. 4). 
The VLSI area occupied by obvious layouts of the (n x n)-ZDMOT (e.g., 
that in Fig. 4) is O(n2(log’ n + A,)), where A, is the area of a leaf processor 
(assuming that leaf processors are as large or larger than the processors 
within the trees). Leighton (1984), who coined the term “mesh of trees,” 
proved the optimality of this upper bound. 
Since our simulation scheme depends on the results of (Luccio et al., 
1990), which in turn depend on the results in (Upfal and Wigderson, 1987) 
it is worthwhile to review those results in a little more detail. In their deter- 
ministic MPC simulation of a P-RAM, Upfal and Wigderson adopted a 
strategy, sometimes referred to as majority rule, first proposed by Thomas 
(1979) and Gifford (1979) in the context of distributed database theory. 
Their scheme distributes r = 2c- 1 copies of each P-RAM variable u 
among the n memory modules of the MPC. Stored along with each copy 
is a time stamp indicating the time at which the copy was last written. The 
scheme simulates each P-RAM step in turn. When a step in which u is 
written is simulated, at least c copies of u are updated. When a step in 
which 11 is read is simulated, at least c copies of u are retrieved, with the 
correct value given by the copy having the most recent time stamp. 
DETERMINISTIC P-RAM SIMULATION 85 
Because of the symmetry of the read and write operations, they can be 
thought of and described jointly as access operations. At any point during 
the simulation of a given P-RAM step, a variable being accessed in that 
step is referred to either as live, if fewer than c copies of the variable have 
been accessed thus far, or as dead, if c or more copies of the variable have 
already been accessed. The live copies of a live variable are those that have 
not yet been accessed. Whenever a live variable “dies,” no further attempts 
are made to retrieve those copies that remain unaccessed. In this way, dead 
variables cannot contend for memory. 
The success of this scheme depends on a lemma showing that it is 
possible to map the copies of the variables to the processors such that, for 
any sufficiently small set of live variables, a significant fraction of the live 
copies reside in distinct processors. 
LEMMA 1 (Upfal and Wigderson, 1987). For constant b > 4 and n suf 
ficiently large, there is a c = O(log m/log b) such that there is a wa), to dis- 
tribute the 2c - 1 copies of each variable among the processors and ensure 
that, for any set of q d nl(2c - 1) live variables, the live copies reside in at 




CT(l) C’W (x3) cr(4) 
86 HORNICK AND PREPARATA 
This same memory map is employed in the deterministic simulation 
scheme of (Luccio et al., 1990). The roots of the row and column trees of 
the 2DMOT are identified (i.e., coalesced), and the processors are located 
at the roots, with the rest of the 2DMOT acting as a switching network to 
route communication packets between the processors. As in (Upfal and 
Wigderson, 1987) the processors are organized into n/(2c - 1) clusters of 
2c - 1 processors each. The simulation proceeds in two stages: In the first 
stage, the processors of a cluster cooperate to access the copies of the 
variable requested by each processor in the cluster in succession; each 
request is processed in O(log log n) phases, with the requests for the 2c - 1 
cluster variables interleaved in time. It is shown that the first stage leaves 
at most n/(2c - 1) unsatisfied requests (i.e., live variables). The second stage 
is devoted to accessing these variables. Again, the processors of a cluster 
cooperate in requesting the same variable in successive phases, but now 
there is only one variable per cluster, and the copy access requests are 
queued, with O(log n) requests satisfied per phase to match the O(logn) 
latency of the 2DMOT. It is shown that O(log n/log log n) phases suflice to 
complete this stage. Since each phase (of either stage) takes O(log n) time, 
their scheme simulates a P-RAM step in O(log’ n/log log n) time. 
This matches the time performance of (Herley and Bilardi, 1988) and is 
an improvement in the sense that the 2DMOT is not plagued by the large 
constants of constructive expander graphs. On the other hand, the (Luccio 
et al., 1990) simulation has O(log n) redundancy, as opposed to the 
@(log n/log log n) redundancy of (Herley and Bilardi, 1988), and the objec- 
tion can be raised that it introduces additional processors (albeit mere 
switches) in the interconnection network. Indeed, this raises the question of 
how much can be gained by relaxing some of the restrictions of the BDN 
model. 
The main contribution of this paper is the elucidation of the crucial role 
played by memory granularity on the redundancy required for deter- 
ministic P-RAM simulation. In particular, in Section 2 we will show that a 
variant of the MPC with tz processors and M memory modules can 
simulate a P-RAM in polylog time with constant redundancy provided that 
M=nl+“, for E > 0; here E > 0 is the characteristic condition of line 
granularity. Then, in Section 3, we propose a distributed-memory, bounded- 
degree network model of computation that allows the use of more memory 
modules and the introduction of switches in the interconnection network. 
We show how the 2DMOT architecture can be used in conjunction with 
fine-grain memories to obtain a fast deterministic P-RAM simulation 
scheme with constant redundancy. Our scheme places the processors at the 
roots of the trees, but, in contrast with that of (Luccio et al., 1990) 
separates the memory cells from the processors and distributes them 
among the leaves of the 2DMOT. In a P-RAM with a large memory, this 
DETERMINISTIC P-RAM SIMULATION 87 
exploits the 2DMOT in a much more powerful manner by increasing the 
bandwidth to the memory. As a result, by ensuring that the memory 
“granule” is not exceedingly small, the VLSI area occupied by the memory 
of the simulating network (excluding the memory map) is on the same 
order as that occupied by the memory of the P-RAM itself. 
An alternative scheme to achieve constant redundancy has been recently 
proposed by Schuster (1987). This scheme uses the information dispersal- 
recovery method suggested by Rabin (1989), whereby a file of b elements 
of a finite field is recoded into a tile of d> b elements from the same field, 
with the property that any b of the elements of the latter permit the 
recovery of the original file. The shared memory is subdivided into m/b 
blocks of size b, and data are stored in recoded form (i.e., each stored block 
has size d). A variable belongs to a block; to access a variable it is sufficient 
to access (d+ b)/2 terms of its block. By choosing b and d both @(log n), 
memory size increases only by a constant factor, although as many as 
@(log n) variables may have to be processed per variable accessed. 
2. THE EFFECT OF MEMORY GRANULARITY ON THE REDUNDANCY OF 
DETERMINISTIC P-RAM SIMULATION 
We begin by noting that the MPC and BDN models impose limitations 
on the simulation which may or may not actually correspond to the 
economic/physical constraints of an implementation. In particular, in a 
P-RAM with a large memory, say m = n2+& for 6 > 0, these models force 
the module size to be very large, at least m/n = n’ + ‘. Since the only access 
to a module is through the associated processor, a great deal of contention 
can occur. In other words, we have essentially imported the “van Neumann 
bottleneck” from conventional serial computation to the P-RAM simula- 
tion problem. Now, in parallel systems that must be built by inter- 
connecting several conventional von Neumann machines, this may be a 
reasonable constraint. However, in systems that can be built “from 
scratch,” considerable advantage can be gained by distributing the memory 
as an entity separate from the processors. 
This consideration motivates the definition of an alternative model, the 
distributed-memory, module parallel computer (DMMPC) model. In this 
model, the n processors are interconnected to M= rm/g memory modules 
by the complete bipartite graph K,,, (Fig. 5). The quantity g is called the 
granularity, i.e., the number of memory cells in each module. 
The original definition of the MPC model in (Mehlhorn and Vishkin, 
1984) actually allows the flexibility of having more memory modules than 
processors, but subsequent usage (Alt et a/., 1987) restricted the model so 
that each memory module is associated with a unique processor, as 
88 HORNICK AND PREPARATA 
. . . 
complete bipartite network 
. . . 
ELI hi 
FIG. 5. The DMMPC model of computation. 
described in Section 1. By distriminating between the MPC model and the 
DMMPC model, we will avoid any possible confusion. 
In (Mehlhorn and Vishkin, 1984), the effect of memory granularity on 
probabilistic P-RAM simulations on a DMMPC was studied. Mehlhorn 
and Vishkin showed that increasing M simplified the class of hash func- 
tions required to ensure polylog expected-time performance. In this section, 
we undertake a similar study for deterministic P-RAM simulations on 
a DMMPC. We will show that increasing M reduces the redundancy 
required to ensure polylog worst-case-time performance. 
Upfal and Wigderson (1987) proved that the time taken to simulate a 
P-RAM step by any MPC simulation scheme which updates an average of 
p copies of each variable is Q((m/n) l/(zp’). In other words, the redundancy 
of a deterministic P-RAM simulation must be sZ(log m/log log m) to ensure 
polylog time on an MPC, a bound later achieved by Herley and Bilardi 
(1988). Our claim may be somewhat surprising in view of this result; there- 
fore, we prove here an analogous result for the DMMPC. The following 
theorem demonstrates the critical role played by memory granularity in 
establishing lower bounds on redundancy. 
THEOREM 1. Any P-RAM simulation scheme running on a DMMPC 
with n processors, M = n’ +’ memory modules, and m = nk variables requires 
redundancy r = Q( (k - 1) log n/(& log n + log h)) to simulate an arbitrary 
P-RAM step in time h, where h < o(n/log m). 
Proof Call S the collection of all (,,,“_ ,) possible sets of n/h - I 
memory modules (assuming, for simplicity, that h divides n). No such set 
of memory modules contains all the updated copies of n variables; if one 
did, then a P-RAM step updating those variables would require simulation 
time n/(n/h - 1) > h (since each updated variable must have at least one 
updated copy), which is a contradiction. Thus, each member of S can con- 
tain all the updated copies of at most n - 1 variables. The situation can be 
modeled by an (,$;“n_, ) x m s-1 matrix, the rows of which are indexed by 
DETERMINISTIC P-RAM SIMULATION 89 
the sets of S, conventionally numbered from 1 to ($, ), and the columns 
of which are indexed by the m variables. The (i, j) entry of the matrix is set 
equal to 1 if and only if all the updated copies of the jth variable reside in 
set number i of S. So, in each row of this matrix, there can be at most n - 1 
ones, which gives a maximum of (,$ !)(n - 1) ones in the matrix. 
Let p be the average number of copies of each variable that are updated. 
Clearly, p < r, and the number of variables with 2p or fewer updated copies 
is at least m/2. We wish to obtain a lower bound on the number of sets of 
S containing the j 6 2p updated copies of one such variable. This lower 
bound is (n,,& L ,) B (,,,,“_; ?+), and it is attained when each updated copy 
belongs to a distinct memory module. Thus, in each column of the matrix 
corresponding to the variables with 2p or fewer updated copies, there are 
at least (,&?+J ones. Therefore, the number of ones in the matrix is at 




Manipulating this equation and taking the logarithm yields 
logm-logn- 1 
“2[log(M-2p+ l)-log(n/h-2p)]’ 
which is satisfied by some 
p=Q 
logm-logn 
log M - log n + log h > 
for h < o(n/log m). Since r > p, we obtain finally 
When k > 1 and E > 0 are constants and h is a polynomial in log n, this 
generalization of Theorem 4.1 in (Upfal and Wigderson, 1987) only yields 
90 HORNICK AND PREPARATA 
a constant as a lower bound on the redundancy. Note that k= 1 
corresponds to the trivial case of one variable per processor (so that no 
contention arises) and E = 0 corresponds to one memory module per pro- 
cessor. This illustrates the crucial role played by granularity in achieving 
constant redundancy. Indeed, constant redundancy is achievable in this 
case, as we will now show. 
The algorithm described in (Upfal and Wigderson, 1987) can be used 
with only a minor modification, namely, an improvement in the argument 
parameter c obtained by tightening Lemma 1 for the case M = n1 +‘. 
LEMMA 2. For constants b > 2 and c> (bk-E)/(&(b - 2)) and n suf- 
ficiently large, there is a way to distribute the 2c-1 copies of each variable 
among the M memory modules such that, for any set of q < n/(2c - 1) live 
variables, the live copies occupy at least (2c - 1 )q/b distinct modules. 
Proof. A memory map is “bad” if it does not satisfy the conditions of 
the theorem, i.e., if there exists some choice of q < q’(2c - 1) variables and 
some choice of c live copies of each of these variables such that the live 
copies occupy fewer than (2~ - 1 )q/b memory modules. We show that, 
asymptotically, the number of “bad” memory maps is a small fraction f of 
the total number of possible memory maps. 
There are (7) ways to choose the q live variables, and the number of 
ways in which a memory map can be “bad” for a particular set of q 
variables is less than 
The first factor is the number of ways to choose the live copies of the q live 
variables, the second is the number of ways to choose the set of congested 
memory modules, the third is the number of ways to map the live copies 
to the congested memory modules, and the last is the number of ways to 
map the remaining copies of all variables (live or dead). Applying the 
union bound and dividing by [(2c - l)m]!, the total number of possible 
memory maps, we obtain 
[(2c- l)‘qm/(bM)]! [(2c- l)m-cq]! 
x [(2c- 1)2qm/(bM)-cq]! [(2c- l)m]! 
DETERMINISTIC P-RAMSIMULATION 91 
Using the inequalities (E) < (ea//?)” and (‘,Y) < ,“/A, we can write 
Assuming c > (b - 1 )/(b - 2), 
Choosing constants b > 2 and c> (bk-E)/(E(~ - 2)) > (h- l)/(h- 2) yields 
f<lZ-“flJ since ,I +(ZC- IJlbb(2~-- ~‘Ilh-r 22’ ( /A) is then merely another 
constant. 1 
Applying this lemma to the algorithm in (Upfal and Wigderson, 1987), 
we immediately obtain the following theorem. 
THEOREM 2. An arbitrary step of an n-processor P-RAM with memor? 
size m =nk (k> 1) can be simulated on a DMMPC with n processors, 
M = n ’ c ’ memory modules, and redundancy r = O((k - E )/E ) = U( 1) in time 
O(log n). 
3. SIMULATING A P-RAM ON A 2DMOT 
The previous section discussed the effect of memory granularity on the 
redundancy of P-RAM simulations by DMMPCs. It is natural to wonder 
92 HORNICK AND PREPARATA 
what effect memory granularity has on P-RAM simulations by bounded- 
degree networks. Strictly speaking, however, the BDN model permits only 
M = n memory modules, each associated with a processor. For this reason, 
we propose a distributed-memory, bounded-degree network (DMBDN) 
model of computation. In this model, the II RAM processors are intercon- 
nected to each other and to M = rm/g memory modules by a bounded- 
degree network (Fig. 6). Further departing from the restrictions of the 
BDN model, we also allow the bounded-degree interconnection network to 
introduce O(m) additional processors, but these are only switches; they 
need not have any computational power. Note that the MPC model and 
the BDN model conceal the existence of m - H similar nodes in the address 
decoding circuitry of the memory modules. 
As mentioned earlier, the 2DMOT simulation scheme of (Luccio et al., 
1990) introduced O(n’) additional switching processors. This resulted in a 
constructive DMBDN with reasonable constants, but the redundancy 
remained O(log n) because the memory granularity was unchanged. One 
approach to reducing memory granularity with a DMBDN would be to 
implement the algorithm of Section 2 using an 12 x M 2DMOT as a 
crossbar switch between processors and memory modules (Fig. 7). A direct 
implementation results in an O(log’ n)-time algorithm with redundancy 
r = O((k - E)/E), while using O(nM) additional switches. The pipelining 
strategies of (Luccio et al., 1990) can reduce the time complexity to 
O(log’ n/log log n). As an extreme case, consider M= m. Obviously, 
redundancy r = 1 suffices to achieve O(log n) time in this case. 
Another 2DMOT simulation scheme deploys the M modules at the 
leaves of the 2DMOT and the n processors at the roots of the first n row 
trees, provided M = O(n’+“) (f or simplicity, we identify row and column 
tree roots, Fig. 8). This simulation scheme, which is admitted by the 
DMBDN model, introduces only O(n + M) = O(M) additional switches, 
but still reduces memory granularity and can thereby achieve constant 
redundancy. 
I bounded-degree network (possibly with additional switches) I 
FIG. 6. The DMBDN model of computation 
DETERMINISTIC P-RAM SIMULATION 93 
FIG. 7. The ZDMOT for constant redundancy P-RAM simulation with O(nA4) additional 
switches (shaded triangles represent balanced binary row and column trees). 
THEOREM 3. If m is polynomial in n and M=n’+” for constant 6 > 0, 
then a &x JM 2DMOT can deterministicaILv simulate a P-RAM step in 
O(log2 n/log log n) time with redundancy r = 0( 1). 
Proqf: The simulation scheme works in essentially the same manner as 
that in (Luccio et al., 19901, exceut for the routing of access reauests. When 
I ,  a I -  ~~--- - -  
, -  
’ ‘\ 
!  -  ’ 
P-RAM : lZJ ; nr II 
processors : l I I - 
dummy / . : 
processors , . : 
(switches) ; . : \I \I 
FIG. 8. The 2DMOT for constant redundancy P-RAM simulation with O(M) additional 
switches. 
94 HORNICK AND PREPARATA 
processor P, must access a variable copy stored in the memory module Mi,, 
located in row i and column j, it sends the request down the Ith row tree 
to the jth leaf. From there, it propagates up to the root of the jth column 
tree (provided it does not collide with a conflicting request), whence it is 
sent down to the ith leaf, i.e., M, ,. The answered request returns to P, 
simply by reversing this path. Other than this, the simulation proceeds as 
in (Luccio et al., 1990), with processors organized as clusters cooperating 
to retrieve copies of the variables needed by one another. However, since 
there are now effectively M’ = a= n1 + *I2 memory modules (the fi 
distinct columns), Lemma 2 can be applied to obtain 0( 1) redundancy. (In 
fact, we can simultaneously access along both rows and columns, which 
further reduces the redundancy by a factor of 2, as can be shown by a 
modification of Lemma 2.) The time complexity remains O(log’ n/log log n) 
as in (Luccio et al., 1990). 1 
As indicated above, the key property of the 2DMOT that we are 
exploiting here is the fact that a @x & 2DMOT provides us with 
bandwidth O(a) f or memory access. In contrast, each memory module 
in an MPC or BDN has bandwidth 1, despite the fact that they would 
require area O(m/n) and perimeter O(m) in VLSI. The 2DMOT simply 
makes better use of the available perimeter. Furthermore, if g = Q(log’ n), 
then the 2DMOT P-RAM simulator can be laid out in O(m) area in VLSI, 
which is clearly optimal. It is also well suited to multi-chip implementations 
since the required interchip connections can all be made on the perimeters 
of the chips. 
4. CONCLUSIONS AND OPEN PROBLEMS 
We have proposed a feasible 2DMOT architecture that performs 
general-purpose computations by simulating a P-RAM. By introducing the 
DMBDN model, we eliminated an unnecessarily severe restriction on the 
memory bandwidth of parallel computers and, thus, reduced the memory 
redundancy required for deterministic P-RAM simulation. Although we 
have removed these restrictions, our 2DMOT architecture still appears to 
be well-suited to VLSI or multichip implementation. 
The DMBDN model gives the designer of general-purpose parallel com- 
puters freedom that might be useful in other ways. For instance, the 
increased memory bandwidth may make it possible to rid deterministic 
P-RAM simulation schemes of the nonconstructive memory map that must 
be stored in each processor. A memory map that could be constructed by 
simple computations within a processor would eliminate the large 
(O(m log ml) bits) address look-up table that each processor must store. 
DETERMINISTIC P-RAMSIMULATION 95 
Failing in this, it may still be possible to simulate a P-ROM, a parallel, 
read-only memory, that would support simultaneous address look-up for 
all processors, and, thus reduce the total look-up table size from 
O(mn log rm) to O(mz log rm) bits. 
The derivation of lower bounds in this new model also poses some 
interesting questions. For instance, the arguments used in (Alt et al., 1987; 
Karlin and Upfal, 1986) to prove the O(log’ n/log log n) deterministic time 
lower bound no longer apply in the DMBDN model. Therefore, it may be 
possible to speed up these DMBDN simulations. It would be interesting to 
derive a coresponding nontrivial lower bound on the time complexity of a 
DMBDN simulation of a P-RAM, especially if the point-to-point commun- 
cation restriction can be removed. 
RECEIVED December 23, 1988; FINAL MANUSCRIPT RECEIVED December 11. 1989 
REFERENCES 
ALT. H., HAGERUP, T.. MEHLHORN, K., AND PREPARATA, F. P. (1987), Deterministic simula- 
tion of idealized parallel computers on more realistic ones, SZAM J. Compuf. 16, 808. 
FORTUNE, S., AND WYLLIE, J. (1978), Parallelism in random access machines, in “Proceedings 
of the 10th Annual ACM Symposium on the Theory of Computing, San Diego, CA, May 
1978,” pp. 114118. 
GIFFORD, D. K. (1979), Weighted voting for replicated data, in “Proceedings of the 7th 
Annual ACM Symposium on Operating System Principles. Pacific Grove, CA, Dec. 
1979,” pp. 15&159. 
HERLEY, K. T., BILARDI. G. (1988), Deterministic simulations of PRAMS on bounded degree 
networks, in “Proceedings of the 26th Annual Allerton Conference on Communication, 
Control and Computing, Monticello, IL, Oct. 1988,” pp. 1084-1093. 
HIRSCHBERG. D. S. (1977), “Fast Parallel Sorting Algorithms,” Technical Report, Department 
of Electrical Engineering, Rice University, Houston, TX. 
KARLIN, A. R.. AND UPFAL, E. (1980), Parallel hashing-An efficient implementation of 
shared memory, in “Proceedings of the 18th Annual ACM Symposium on the Theory of 
Computing, Berkeley, CA, May 1986.” pp. 16G168. 
KARP. R. M., AND RAMACHANDRAN, V. (1988), “A Survey of Parallel Algorithms for Shared- 
Memory Machines,” Technical Report UCBjCSD 88/408, Computer Science Department, 
University of California at Berkeley, Berkeley, CA. 
LEIGHTON. F. T. (1984), New lower bound techniques for VLSI, M&z. Systems Theory 17,47. 
LUCCIO, F., PIETRACAPRINA, A.. AND PUCCI. G. (1988), A probabilistic simulation of PRAMS 
on a bounded-degree network, Inform Procrss. Left. 28, 141. 
LUCCIO, F.. PIETRACAPRINA, A., AND PUCCI, G. (1990), A new scheme for the deterministic 
simulation of PRAMS in VLSI, Algorifhmica 5, 529. 
MEHLHOKN, K.. AND VISHKIN, U. (1984)W, Randomized and deterministic simulations of 
PRAMS by parallel machines with restricted granularity of parallel memories, Acra 
Informat. 21, 339. 
NATH, D., MAFESHWARI. S. N., AND BHATT. P. C. P., Efficient VLSI networks for parallel 
processing based on orthogonal trees, IEEE Trans. Comput. C-32, 569. 
96 HORNICK AND PREPARATA 
PREPARATA, F. P. (1977), Parallelism in sorting, in “Proceedings of the 1977 International 
Conference on Parallel Procesisng, St. Charles, IL, Aug. 1977,” pp. 202-206. 
RABIN, M. 0. (1989), Efficient dispersal of information for security, load balancing, and fault 
tolerance, J. Assoc. Comput. Mach. 36, 335. 
RANADE, A. G. (1987). How to emulate shared memory, in “Proceedings of the 28th Annual 
Symposium on the Foundations of Computer Science, Los Angeles, CA, Oct. 1987,” 
pp. 185-194. 
SCHUSTER, A. (1987), How to share memory in a distributed system using a small extra space, 
unpublished manuscript, Computer Science Department, Hebrew University, Jerusalem. 
THOMAS, R. H. (1979), A majority consensus approach to concurrency control for multiple 
copy databases, ACM Trans. Database Systems, Jun. 1979, 180. 
UPFAL, E. (1984), A probabilistic relation between desirable and feasible models of parallel 
computation, in “Proceedings of the 16th Annual ACM Symposium on the Theory of 
Computing, Washington, D.C., May 1984,” pp. 258-265. 
UPFAL. E., AND WIGDERSON, A. (1987), How to share memory in a distributed system, 
J. Assoc. Comput. Mach. 34, 116. 
