Deterministic P-RAM Simulation with Constant Redundancy * (Preliminary Version) by Scot W. Homick & Franco P. Preparata
Deterministic  P-RAM  Simulation with Constant Redundancy * 
(Preliminary  Version) 
Scot W. Homick 
Andersen Consulting 
Center for Strategic Technology Research 
100 S. Wacker 
Chicago, IL  60606 
Franco P. Preparata 
Coordinated Science Laboratory 
University of Illinois 
I101  W. Springfield 
Urbana,  IL  61801 
Abstract:  In  this  paper,  we  show  that distributing  the 
memory of a  parallel computer and, thereby,  decreasing 
its  granularity  allows  a  reduction  in  the  redundancy 
required  to  achieve polylog simulation time  for each P- 
RAM step.  Previously, realistic  models of parallel com- 
putation assigned one memory module to each processor 
and,  as  a  result,  insisted  on  relatively  coarse-grain 
memory.  We propose,  on the other hand, a  more flexi- 
ble,  but  equally  valid  model  of  computation,  the 
distributed-memory, bounded-degree network  (DMBDN) 
model.  This model allows the use of fine-grain  memory 
while maintaining the realism of a bounded-degree  inter- 
connection  network.  We describe a  P-RAM simulation 
scheme,  which  is  admitted  under  the  DMBDN  model, 
that exploits  the  increased  memory bandwidth  provided 
by  a  two-dimensional mesh  of trees  (2DMOT)  network 
to  achieve  an  overhead  in  memory  redundancy  lower 
than  that  required  by  other  fast,  deterministic P-RAM 
simulations.  Specifically, for a  deterministic  simulation 
of an n-processor P-RAM on a bounded-degree network, 
we are able to reduce the number of copies of each vari- 
able  from  O(logn/loglogn)  to  ®(1) and  still  simulate 
each P-RAM step in polylog time. 
1.  Introduction 
Considerable  research  has  been  devoted  to 
developing general-purpose  architectures  that exploit the 
parallelism  offered by modern integration  technology.  A 
popular  theoretical approach  to this problem has been the 
design  of  processor  networks  for  the  simulation  of 
abstract  models  of  computation  like  the  parallel, 
random-access  machine (P-RAM)  model  [U84,UW84, 
tSupported  in  part  by  the  Semiconductor  Research  Corporation  under 
contract 87-I3P-109. 
Permission to copy without fee all or part of this material is granted provided that 
the copies are not made or distributed for direct commercial advantage, the ACM 
copyright notice and the title of the publication and its date appear, and notice is 
given that copying is by permission of the Association for Computing Machinery. 
To copy otherwise, or to republish, requires a fee and/or specific permission. 
MV84,KU86,AHMP87,R87,LPP88,LPP89].  The P-RAM 
model of computation, formalized by Fortune and Wyllie 
[FW78]  and used even earlier  by Hirschberg  [H77] and 
Preparata  [P77],  has been a  valuable  tool for theoretical 
computer scientists  studying the power and fundamental 
limitations  of parallelism  (see  [KR88] for  a  survey  of 
results).  By assuming the existence  of a shared memory 
accessible  to  all  processors  in  O(1)  time,  the  P-RAM 
model trivializes  the problem of inter-processor  commun- 
ication  to  reveal  the  inherent  parallelism  in  a  problem 
and facilitate the development of parallel  algorithms. 
Formally, a  P-RAM consists  of n  sequential  pro- 
cessors  (RAMs)  and m  shared  memory cells  (Figure  1). 
These  processors  operate  synchronously  and,  at  each 
step,  fetch an instruction  from a  private  RAM and exe- 
cute  it.  Executing  these  instructions  may  require 
accesses  to  the  shared  memory, and,  in  particular,  the 
processors  may all simultaneously read from or write  to 
the shared  memory at any given step.  Several  variants 
of the P-RAM have been  defined, each differing in the 
convention  applied  tO  handle  read/write  conflicts,  i.e., 
attempts by more than one processor  to access  the same 
memory  cell  in  the  same  step.  P-RAMs  are  either 
exclusive-read (ER)  or concurrent-read (CR)  and either 
exclusive-write  (EW) or  concurrent-write  (CW).  The 
most restrictive  of these is the EREW P-RAM in which 
no memory cell may be accessed  by more than one pro- 
cessor  in  a  given  step.  The  least  restrictive  is  the 
CRCW P-RAM, in which simultaneous  reading and writ- 
ing of memory cells is allowed, with some rule defining 
the exact semantics  of simultaneous writes. 
shared memory 
Figure 1. The P-RAM model of computation 
©  1989 ACM  0-89791-323-X/89/0006/0103  $1.50 
103 Of coarse,  the P-RAM model itself is  not techno- 
logically  feasible  for  a  large  number  of  processors. 
Therefore,  research  has  been  directed  toward  simulating 
the  P-RAM model on more realistic models of computa- 
tion,  models that account for communication costs.  The 
two  most common among these are  the module parallel 
comp~ter (MPC) model  [MV84] and the bounded-degree 
network  (BDN)  model  [AHMP87].  The  MPC  model 
takes  into account the fact that it is not feasible for any 
more  than  a constant number of processors  to simultane- 
ously access  the same memory module.  It consists of n 
RAM processors, each equipped with a  memory module 
containing m/n  memory cells,  and all interconnected by 
the complete graph Kn  (Figure 2).  However, this model 
itself is.not  feasible  because  the  complete  graph  inter- 
connecting the processors cannot be realized for large n 
without  unbounded  fan-in  or  fan-out.  This  led  to  the 
consideration of the  BDN model, in which each proces- 
sor is  linked directly to only a  constant number of other 
processors  (Figure 3). 
complete  network 
Figure 2. The MPC model of computation 
: 
bounded.degree network 
Figure 3. The BDN model of computation 
In  [MV84],  Mehlhorn and Vishkin showed that the 
MPC  model  can probabilistically  simulate T  steps  of a 
P-RAM  in  O(Tlogn)  time  by  using  universal  hashing 
and  by  increasing  the  capacity of each  memory module 
to O(m/n logn).  Upfal  proved a similar result in [U84], 
where  he  also  gave  an  O(Tlog2n)  time  probabilistic 
simulation  on  a  BDN.  This  result  was  subsequently 
improved  by  Karlin  and  Upfal,  who  described  a 
@(Tioga)  time  probabilistic  simulation  on  a  BDN 
[KU86]  (which  is  optimal  with  respect  to time), and by 
Ranadc,  who reduced the size of the  queues  used in the 
simulation from O (logn) to O(1)  [R87]. 
The  first  reasonable deterministic  P-RAM  simula- 
tion  on  an  MPC  was  that  of  Upfal  and  Wigderson 
[UW84],  which  uses  O(logn)  time-stamped  copies  of 
each  variable  m  simulate  each  P-RAM  step  in 
O(logn(loglogn)2)  time  (assuming  m  is  polynomial  in 
n).  Alt  et  al.  improved  this  upper  bound  to  O (:ogre) 
time and  used  this  simulation along with  a  sorting  :set- 
work to give an O (logn logm) time simulation on a BDN 
[AHMP87].  They  also  proved  a  lower  bound  of 
f2(min {'Jn-logn ,logn logm/loglogm })  on  the  time 
required to simulate a P-RAM step on a BDN if all com- 
munication is required to be point.to-point,  i.e., if a pro- 
cessor  has  to  send  a  separate  message  to  update  each 
copy  of  a  variable.  (The  same  result  was  obtained 
independently in [KU86].)  Recently, Herley and Bilardi 
achieved this  time lower bound and reduced the  redun- 
dancy r, i.e.,  the required number of copies  of each vari- 
able,  to  r  = O(logmlloglogm)  by  using bounded-degree 
networks based on certain expander graphs [HB88]. 
Luccio  et  al.  have  recently  suggested  the  two- 
dimensional mesh of trees (2DMOT)  [NMB83]  as a prac- 
tical  bounded-degree  network  for  the  simulation  of P- 
RAMs  with m  polynomial in  n.  In  [LPP88],  they pro- 
posed this network for the probabilistic  simulation of P- 
RAMs, and in  [LPP891, they proposed it for determinis- 
tic  simulation, describing  a  scheme  that  simulates a  P- 
RAM step in O (log2n/loglogn ) time on a 2DMOT.  This 
matches  the  time  performance  of  [HB88]  and  is  an 
improvement  in  the  sense  that  the  2DMOT  is  not 
plagued by the large constants of constructive expander 
graphs.  On  the other hand, the  [LPP89]  simulation has 
O (logn) redundancy, as opposed to the @(logn/loglogn ) 
redundancy of [HB88],  and  the objection can  be  raised 
that  it  introduces  additional  processors  (albeit  mere 
switches)  in  the  interconnection  network.  Indeed,  this 
raises the question of how much can be gained by relax- 
ing some of the restrictions of the BDN model. 
The main contribution of this paper is the elucida- 
tion of the crucial role played by memory granularity on 
the redundancy required for deterministic P-RAM simu- 
lation.  In particular, in  Section  2  we  will  show  that  a 
variant of the  MPC  with n  processors  and M  memory 
modules  can  simulate  a  P-RAM  in  polylog  time  with 
constant  redundancy provided  that M  = n :+e, for 8 > 0; 
here e > 0  is  the characteristic condition of fine  granu- 
larity.  Then,  in  Section  3,  we propose  the  distributed- 
memory, bounded-degree network model of computation 
that  allows  the  use  of more  memory  modules  and  the 
introduction of switches in  the  interconnection network. 
We show how  the 2DMOT architecture can  be used in 
conjunction  with  fine-grain  memories  to  obtain  a  fast 
deterministic  P-RAM  simulation  scheme  with  constant 
redundancy.  Our  scheme  places  the  processors  at  the 
roots of the trees, but,  in contrast with that of [LPP89], 
separates  the memory cells from the processors and dis- 
tributes  them among the leaves of the 2DMOT.  In a P- 
RAM with a large memory, this exploits the 2DMOT in 
a  much  more  powerful  manner  by  increasing  the 
bandwidth to the memory.  As a  result,  by ensuring that 
the  memory  "granule"  is  not  exceedingly  small,  the 
VLSI  area  occupied  by  the  memory of the  simulating 
network  (excluding  the  memory  map)  is  on  the  same 
104 order  as  that  occupied  by  the  memory  of  the  P-RAM 
itself. 
2.  The  Effect of Memory Granularity  on  the  Redun. 
dancy of Deterministic  P-RAM Simulation 
We  begin  by  noting  that  the  MPC  and  BDN 
models impose limitations  on  the  simulation  which may 
or may  not actually correspond to. the economic/physical 
constraints  of an  implementation.  In particular,  in  a  P- 
RAM  with  a  large  memory,  say  m  =n 2+5  for  ~>0, 
these  models  force the  module  size  to be very large,  at 
least m/n  = n 1+~.  Since  the  only access to a  module  is 
through  the  associated processor, a  great deal of conten- 
tion  can  occur.  In  other  words,  we  have  essentially 
imported  the  "von  Neumann  bottleneck!' from  conven- 
tional  serial computation  to the P-RAM simulation  prob- 
lem.  Now,  in  parallel  systems  that  must  be  built  by 
interconnecting  several  conventional  von  Neumann 
machines,  this  may be a reasonable constraint.  However, 
in  systems that  can be built "from scratch,"  considerable 
advantage  can  be gained  by distributing  the  memory  as 
an entity separate from the processors. 
This  consideration  motivates  the  definition  of an 
alternative  model,  the distributed-memory,  module paral- 
lel  computer  (DMMPC)  model.  In  this  model,  the  n 
processors  are  interconnected  to  M  =rm/g  memory 
modules by the complete bipartite graph  K,~  (Figure 4). 
The quantity g  is called the granularity,  i.e.,  the number 
of memory cells in each module.  The original  definition 
of the  MPC model in  [MV84]  actually  allows the flexi- 
bility  of having  more memory modules  than  processors, 
but subsequent usage  [AHMP87] restricted the model so 
that  each  memory  module  is  associated  with  a  unique 
processor, as  described  in  Section  1.  By discriminating 
between  the  MPC  model  and  the  DMMPC  model,  we 
will avoid any possible confusion. 
: 
Figure 4. The DMMPC model of computation 
In  [MV84],  the  effect  of  memory  granularity  on 
probabitistic  P-RAM simulations  on a  DMMPC was stu- 
died.  Mehlhorn  and  Vishkin  showed that  increasing  M 
simplified the  class  of hash  functions required  to ensure 
polylog expected-time  performance.  In  this  section,  we 
undertake a  similar study for deterministic P-RAM simu- 
lations  on a  DMMPC.  We will  show that  increasing  M 
reduces  the  redundancy  required  to  ensure  polylog 
worst-case-time performance. 
105 
Upfal  and  Wigderson  proved  that  time  taken  to 
simulate a  P-RAM step by any MPC simulation  scheme 
which updates an average of p  copies of each variable is 
f~((m/n)  1/C2p)) [UW84].  In  other  words, the  redundancy 
of  a  deterministic  P-RAM  simulation  must  be 
O(logm/loglogm)  to ensure polylog time  on an  MPC, a 
bound later achieved by Herley and Bilardi [HB88].  Our 
claim may be somewhat surprising  in view of this result; 
therefore,  we  give  here  an  analogous  result  for  the 
DMMPC.  The following theorem demonstrates the criti- 
cal  role  played  by  memory  granularity  in  establishing 
lower bounds on redundancy. 
Theorem  1:  Any  P-RAM  simulation  scheme  running 
on  a  DMMPC  with  n  processors,  M  = n 1+e  memory 
modules,  and  m  =n k  variables  requires  redundancy 
r  = f~[  (k-1)logn  ]  to  simulate  an  arbitrary  P-RAM 
L  ~logn  + logh J 
step in time h, where h  < o (n/logm ). 
Proof:  Call  S  the  collection  of all  Ih  -  1  possible 
sets of n Ih  -  1 memory modules (assuming,  for simpli- 
city, that h  divides n).  No such set of memory modules 
contains all the updated copies of n  variables;  if one did, 
then  a  P-RAM  step  updating  those  variables  would 
n 
require  simulation  time  n/h-----2-  ~ >h  (since  each 
updated  variable  must have  at  least one updated  copy), 
which  is  a  contradiction.  Thus,  each  member  of S  can 
contain  all  the updated copies of at TosS/n-1 variables. 
The  situation  can be  modeled by a  Lnlh _  lj  x  m  0-1 
matrix,  the rows of which are indexe~ b~dthelsets of S, 
conventionally  numbered  from  1  tolnlh  _  11'  and  the 
columns  of which  are  indexed by the m  variables.  The 
(i ,j) entry of the  matrix  is set equal to  1 if and only if 
all  the  updated  copies  of the jth  variable  reside  in  set 
number i  of S.  So, in each row of this  matrix,  there can 
e  at  m~ost n-  1  ones,  which  gives  a  maximum  of 
M 
-1t  |(n  -  1) ones in the matrix.  Ih 
Let p  be the average number  of copies of each  variable 
that are updated.  Clearly, p _< r, and the number of vari- 
ables  with  2p  or  fewer updated  copies  is  at  least m/2. 
We wish to obtain a  lower bound on the number of sets 
of S  containing  the j  <_ 213  updated  cop~es of one sl~ch 
nM'J  ! 
variable.  This  lower  bound  is  /h  -  1 -il  >- 
J,/ 
Ih-  1-  213J' and  it  is  attained  when  each  updated 
copy  belongs  to  a  distinct  memory  module.  Thus,  in 
each column of the matrix  corresponding to the variables 
ith  213  or  ~wer  updated  copies,  there  are  at  tcast 
M-213  | 
2_lpj ones.  Therefore,  the number of ones in  Ih  1 c 
m  M  -- 29  ] 
the matrix is at least  2  ~n/h  -  1 -- 29J"  Comparing 
with  the maximum  number of ones gives 
/h  -  i  (n-l) _> TOlh  __  1  -  2pj 
,n  >-- 
I, 
M  -  2o  1  -  2(n-1)' 
/h  -  1 -  2pJ 
which implies 
this 
Manipulating  this  equation  and  taking  the  logarithm 
yields 
logm  -  logn  -  1 
9>  2[log(M-29+1 ) -  log(n/h-2p)] ' 
which is satisfied by some 
£2 [--  logm  -  logn  1 
9 =  Ll°gM  -  logn  + logh 
for  h .<o(n/logm).  Since  r  2  9 ,  we  obtain  finally 
f 
r  =  ~} 
(k-1)logn  [] 
t elogn  + logh 
When  k  >  1  and  e  >  0  are  constants  and  h  is  a  polyno- 
mial  in  logn,  this  generalization  of  Theorem  4.1  in 
[UW84]  only yields  a  constant  as  a  lower bound  on  the 
redundancy.  Note  that  k  =  t  corresponds  to  the  trivial 
case of one variable per processor (so that no contention 
arises) and e  = 0  corresponds to one memory module per 
processor.  This  illustrates  the  central  importance  of 
memory  granularity  in  establishing  lower  bounds  on 
redundancy.  Indeed,  constant  redundancy  is  achievable 
in  this case, as we will now show. 
Following  the  strategy  first  proposed  by  Thomas 
[T79]  and  Gifford  [G79]  in  the  context  of  distributed 
database  theory and  later adopted  in  [UW84]  in the con- 
text of P-RAM  simulation,  our simulation  scheme distri- 
butes  r  =  2c-  1  copies  of  each  P-RAM  variable  u 
among  the M  memory modules of the DMMPC.  Stored 
along with each copy is a  time stamp  indicating the time 
a  which  the  copy was  last  written.  Whenever u  is  writ- 
ten,  at  least  c  copies  of it  are  updated.  Whenever u  is 
read, at least c  copies of it are retrieved, with the correct 
value  given  by  the  copy  having  the  most  recent  time 
stamp.  However,  whereas  c  =  O(logn)  in  [UW84],  we 
will  show that c  =  O (1) suffices here. 
In  fact, the  algorithm  described in  [UW84]  can be 
used  with  only  a  minor  modilication,  namety,  an 
improvement  in  the  argument  parameter  c  obtained  by 
tightening Lemma 3.2 of [UW84]  for the case M  = n ~'+~. 
k  -  l-e/b 
Lemma  1:  For  constants  b  >2  and  c  > 
e(1-2/  b ) 
and n  sufficiently large,  there is  a  way  to  distribute  the 
2c  -  1  copies  of each  variable  among  the  M  memory 
modules such that, for any set of q  _< n/(2c-1)  live vari- 
ables,  the  live copies  occupy at least (2c-1)q/b  distinct 
modules. 
Proof:  A  memory map is  "bad" if it does not satisfy the 
conditions  of  the  theorem,  i.e.,  if  there  exists  some 
choice of q  _< n/(2c-1)  variables  and  some choice of c 
live  copies  of each  of these  variables  such  that  the  live 
copies  occupy fewer  than  (2c-1)q/b  memory  modules. 
We  show  that,  asymptotically,  the  number  of  "bad" 
memory maps  is  a  small  fraction f  of the  total  number 
of possible memory maps. 
f~ 
There are  I~l  ways  to choose the  q  five variables,  and 
w 
the  number  of  ways  in  which  a  memory  map  can  be 
"bad" for a particular set of q  variables is less than 
c  [(2c-1):Zqm/(bM) -  cq ~" 
[(2c-1)m  -  cq]!. 
The first factor is the number of ways to choose the live 
copies of the q  live variables,  the  second  is  the  number 
of ways to choose the set of congested memory modules, 
the third is the number of ways to map the live copies to 
the  congested  memory  modules,  and  the  last  is  the 
number of ways to map the remaining copies of all vari- 
ables  (live  or  dead).  Applying  the  union  bound  and 
dividing  by  [(2c-1)m]!,  the  total  number  of  possible 
memory maps, we obtain 
f  <  E  c  J  [(2c-1)q/bJ 
n 
q  2c2, 
[(2c-1)2qm/(bM)]!  [(2c-1)m  -  cq]! 
[(2c-1)2qm/(bM) -  cq]!  [(2c-1)m]! 
K  n 
q  2c--1' 
Using the inequalities 
can write 
l'r  M 
[(2c-1)qlb 
<  and  <  7~-, we 
106 f  < 
<  n 
q- 2c-1 
(2c-1)q 
r (2c-1)q 1  cq 
bM  J 
The  maximum  term  for  this  summation  occurs  when 
q  = n/(2c-1),  so 
e(2c-1)m]~-TI22c  ]~2i 
Iel 
~"  n  2c-1 
2c-1 
_.____n___  e(2c_l)2~(eb)  b 
2c -1  q~-b ~ 
2c-1 
2c-1 
_  ~  e(2c_l)2~(eb)  b  k-l+e 
2c -1  qTb C  n 
k  -  1-e/b  Choosing  constants  b  > 2  and  c  >  yields 
-1 e(1-2/b ) 
f  < n -°(n)  since  e(2c-1)2~(eb)  b  is  then  merely 
qTb c 
another constant.  [] 
Applying  this  lemma  to  the  algorithm  in  [UW84],  we 
immediately obtain the following theorem. 
Theorem  2:  An  arbitrary  step  of an  n-processor  P- 
RAM  with  memory size  m = n k  (k  >  1)  can  be  simu- 
lated  on  a  DMMPC  with  n  processors,  M  =  n l+e 
memory  modules,  and  redundancy  r  =  O((k-1)/e)= 
O(1) in time O (logn). 
3.  Simulating  a  P-RAM  on  a  2DMOT 
The  previous  section  discussed  the  effect  of 
memory granularity on the redundancy of P-RAM simu- 
lations by DMMPCs.  It is natural to wonder what effect 
memory  granularity  has  on  P-RAM  simulations  by 
bounded-degree  networks.  Strictly  speaking,  however, 
the  BDN  model permits  only M  = n  memory modules, 
each  associated  with a  processor.  For  this  reason,  we 
propose  a  distributed-memory,  bounded-degree  network 
(DMBDN)  model of computation.  In this model, the n 
RAM processors  are  interconnected to each other and to 
M  = rm/g  memory  modules  by  a  bounded-degree 
107 
network (Figure  5).  Further departing  from the restric- 
tions  of the  BDN  model,  we  also  allow  the  bounded- 
degree  interconnection  network to introduce  O (m) addi- 
tional processors,  but these are  only switches;  they need 
not have  any computational power.  Note  that  both  the 
MPC  model and  the BDN model conceal  the  existence 
of m  -  n  similar nodes in the address  decoding circuitry 
of the memory modules. 
• 
bounded-degree network 
(possibly with additional switches) 
... 
Figure 5. The DMBDN model of computation 
As  mentioned  earlier,  the  2DMOT  simulation 
scheme of [LPP89]  introduced  O(n 2)  additional  switch- 
ing processors.  This resulted in  a constructive  DMBDN 
with reasonable  constants,  but the redundancy  remained 
O(logn)  because  the  memory  granularity  was 
unchanged.  One approach  to reducing memory granular- 
ity with a DMBDN would be to implement the algorithm 
of  Section  2  using  an  n  x M  2DMOT  as  a  crossbar 
switch between processors  and memory modules (Figure 
6).  A direct implementation results in an O (log2n)-time 
algorithm with redundancy r  = O ((k-D/e),  while using 
O(nM)  additional switches.  By applying the  pipelining 
strategies  of  [LPP89],  the  time  complexity  can  be 
reduced to 0 (logZn/loglogn).  As an extreme case, con- 
sider M  = m.  Obviously, redundancy r  =  1 suffices  to 




Figure 6. The 2DMOT for constant redundancy P-RAM 
simulation  with O (riM) additional switches (shaded tri- 
angles represent balanced binary row and column trees) Another  2DMOT  simulation  scheme  deploys  the 
M  modules at the leaves of the 2DMOT and the n  pro- 
cessors  at  the  rooks  of  the  first  n  row  trees,  provided 
M  = O(n 2+a)  (for  simplicity,  we  identify,  row  and 
column  tree  roots,  Figure  7).  This  simulation  scheme, 
which  is  admitted  by  the  DMBDN  model,  introduces 
only  O(n  +M)=O(M)  additional  switches,  but  still 
reduces  memory  granularity  and  can  thereby  achieve 
constant redundancy. 
Theorem  3:  If m  is polynomial in n  and M  =  n 2+8 for 
constant 6 > 0,  then a "4M x  ",fit 2DMOT can determin- 
istically  simulate  a  P-RAM  step  in  O(log2n/loglogn) 
time with redundancy r  =  O (1). 
Proof:  The  simulation  scheme  works  in  essentially the 
same  manner  as  that in  [LPP89],  except for the muting 
of  access  requests.  When  processor  Pt  must  access  a 
variable  copy stored in the memorY module Mi,j  located 
in row i  and column j, it sends the request down the l th 
row tree to the j th leaf.  From there, it propagates up to 
the root of the j th column tree (provided it does not col- 
tide with a conflicting request), whence it is sent down to 
the  i th  leaf, i.e., Mi,  j .  The answered request returns  to 
Pt  simply  by  reversing  this  path.  Other  than  this,  the 
simulation proceeds as in  [LPP89], and their analysis fol- 
lows.  Lemma  1  can  be  applied  to  obtain  O (1)  redun- 
dancy.  (In  fact,  we  can  simultaneously  access  along 
both rows and columns, which further reduces the redun- 
dancy  by  a  factor  of  2,  as  can  be  shown  by  a 
modification  of  Lemma  1.)  The  time  complexity 
remains O(log2n/loglogn)as  in [LPP89].  [] 
As  indicated  above,  the  key  property  of  the 
2DMOT  that  we  are  exploiting  here  is  the  fact  that  a 
~-  x .t~  2DMOT provides us with bandwidth 0 (,~) 
for memory access.  In contrast, each memory module in 
an  MPC  or BDN has  bandwidth  1,  despite  the  fact that 
they  would  require  area  0 (re~n)  and  perimeter 
O ('ffnTn)  in  VLSI.  The  2DMOT  simply  makes  better 
use  of  the  available  perimeter.  Furthermore,  if 
g  = f2(log2n),  then  the  2DMOT  P-RAM  simulator  can 
be  laid  out  in  O(m)  area  in  VLSI,  which  is  clearly 
optimal.  It is also well-suited to multi-chip implementa- 
tions since the required interchip  connections can all be 
made on the perimeters of the chips. 
4.  Conclusions and Open Problems 
We  have proposed a  feasible 2DMOT architecture 
that performs  general-purpose  computations  by  simulat- 
ing a  P-RAM.  By introducing the DMBDN  model,  we 
eliminated  an  unnecessarily  severe  restriction  on  the 
memory  bandwidth  of  parallel  computers  and,  thus, 
reduced the memory redundancy required for determinis- 
tic P-RAM simulation.  Although we have removed these 
restrictions, our 2DMOT  architecture still appears  to be 
well-suited to VLSI or multichip implementation. 
The  DMBDN  model  gives  the  designer  of 
general-purpose  parallel  computers  freedom  that  might 
be  useful  in  other  ways.  For  instance,  the  increased 
memory  bandwidth  may  make  it  possible  to  rid  deter- 
minisfic P-RAM simulation schemes of the nonconstruc- 














Figure 7. The 2DMOT for constant redundancy P-RAM simulation with O (M) additional switches 
108 A  memory  map  that  could  be  constructed  by  simple 
computations  within  a  processor  would  eliminate  the 
large  (O (mlogrm)  bits)  address  look-up table  that  each 
processor must store.  Failing  in  this,  it may still  be pos- 
sible to simulate a P-ROM,  a parallel, read-only memory, 
that  would support the simultaneous address look-up for 
all  processors,  and,  thus,  reduce  the  total  look-up  table 
size from  O(mnlogrm)  to O(rmlogrm) bits. 
The  derivation  of lower bounds in this  new model 
also poses some interesting  questions.  For instance,  the 
arguments  used in  [AHMP87] and  [KU86]  to prove the 
O(log2n/loglogn)  deterministic  time  lower  bound  no 
longer  apply  in  the DMBDN model.  Therefore,  it may 
be  possible to  speed up  these  DMBDN simulations.  It 
would be interesting  to derive a corresponding nontrivial 
lower bound on the time complexity of a DMBDN simu- 
lation of a  P-RAM, especially if the point-to-point com- 









H.  Alt,  T.  Hagerup,  K.  Mehlhorn,  and F.  P. 
Preparata,  "Deterministic  Simulation  of 
Idealized Parallel  Computers  on More Real- 
istic  Ones,"  SIAM  Journal  on  Computing, 
vol.  16, no. 5, Oct.  1987, pp. 808-835. 
S.  Fortune  and  J.  Wyllie,  "Parallelism  in 
Random  Access  Machines,"  Proceedings  of 
the  lOth  Annual  ACM  Symposium  on  the 
Theory  of Computing,  San  Diego,  CA, May 
1978, pp. 114-118. 
D.  K.  Gifford,  "Weighted  Voting  for Repli- 
cated  Data,"  Proceedings  of the  7th Annual 
ACM Symposium  on Operating System  Prin- 
ciples,  Pacific  Grove,  CA,  Dec.  1979,  pp. 
150-159. 
K.  T.  Herley and  G.  Bilardi,  "Deterministic 
Simulations  of PRAMs on  Bounded  Degree 
Networks,"  Proceedings  of the  26th Annual 
Allerton  Conference  on  Communication, 
Control,  and  Computing,  Monticello,  IL, 
Oct. 1988, pp.  1084-1093. 
D.  S.  Hirschberg,  "Fast  Parallel  Sorting 
Algorithms,"  technical  report,  Department  of 
Electrical  Engineering,  Rice  University, 
Houston, TX, Jan.  1977. 
A. R. Karlin  and E. Upfal, "Parallel  Hashing 
-  an  Efficient  Implementation  of  Shared 
Memory,"  Proceedings  of the  18th  Annual 
ACM Symposium  on  the  Theory of Comput- 
ing, Berkeley, CA, May 1986, pp. 160-168. 
R.  M.  Karp  and  V. Ramachandran,  "A Sur- 
vey  of  Parallel  Algorithms  for  Shared- 
Memory  Machines,"  technical  report 











Department,  University  of  California  at 
Berkeley, Berkeley, CA, Mar.  1988. 
F.  T.  Leighton,  "New  Lower  Bound  Tech- 
niques  for  VLSI,"  Mathematical  Systems 
Theory, vol. 17, Jan.  1984, pp. 47-70. 
F. Luccio, A. Pietracaprina,  and G. Pucci, "A 
Probabilistic  Simulation  of  PRAMs  on  a 
Bounded-Degree Network," Information  Pro- 
cessing  Letters,  vol.  28, Jul.  1988,  pp.  141- 
147. 
F. Luccio, A. Pietracaprina,  and G. Pucci, "A 
New  Scheme  for  the  Deterministic  Simula- 
tion  of PRAMs in VLSI," to appear in SIAM 
Journal on Computing,  1989. 
K.  Mehlhorn  and  U.  Vishkin,  "Randomized 
and Deterministic  Simulations of PRAMs by 
Parallel  Machines  with  Restricted  Granular- 
ity of Parallel  Memories,"  Acta lnformatica, 
vol. 21, no. 4, Nov. 1984, pp. 339-374. 
D.  Nath,  S.  N.  Maheshwari,  P.  C.  P.  Bhatt, 
"Efficient  VLSI  Networks  for  Parallel  Pro- 
cessing  Based  on  Orthogonal  Trees,"  IEEE 
Transactions on Computers, vol. C-32, no. 6, 
Jun.  1983, pp. 569-581. 
F.  P.  Preparata,  "Parallelism  in  Sorting," 
Proceedings  of  the  1977  International 
Conference  on  Parallel  Processing,  St. 
Charles, IL, Aug.  1977, pp. 202-206. 
A.  G.  Ranade,  "How  to  Emulate  Shared 
Memory,"  Proceedings  of the  28th  Annual 
Symposium  on the Foundations  of Computer 
Science,  Los  Angeles,  CA,  Oct.  1987,  pp. 
185-194. 
R.  H.  Thomas,  "A  Majority  Consensus 
Approach to Concurrency Control  for Multi- 
ple Copy Databases," ACM Transactions on 
Database Systems, Jun.  1979, pp.  180-209. 
E. Upfal, "A Probabilistic Relation  Between 
Desirable  and  Feasible  Models  of  Parallel 
Computation,"  Proceedings  of  the  16th 
Annual  ACM  Symposium  on  the  Theory  of 
Computing,  Washington,  D.C.,  May  1984, 
pp. 258-265. 
E. Upfal and  A.  Wigderson,  "How to  Share 
Memory in  a  Distributed System," Proceed- 
ings  of the  25th Annual Symposium  on  the 
Foundations  of  Computer  Science,  Singer 
Island,  FL, Oct. 1984, pp. 171-180. 
109 