Parallel random access machines with bounded memory wordsize  by Bellantoni, Stephen J.
NFORMATION AND COMPUTATION 91, 259-273 (1991) 
Parallel Random Access Machines with 
Bounded Memory Wordsize 
STEPHEN J. BELLANTONI* 
Department of Computer Science, University of Toronto, 
Toronto, Ontario, Canada MSS 3BS 
The PRAM model of parallel computation is examined with respect to wordsize, 
the number of bits which can be held in each global memory cell. First, adversary 
arguments are used to show the incomparability of certain machines which store 
the same amount of global information but which differ in wordsize. Next, for 
machines with infinitely many memory cells, a counting argument is used to show 
a large lower bound and to separate a hierarchy of machine classes based on 
wordsize. Finally, an efficient simulation by boolean circuits is used to give a simple 
new proof of the tight Q((logn)/(log log n)) time bound for PARITY on small- 
wordsize machines. Overall the results suggest that, in some circumstances, the 
memory wordsize is a more significant resource than the write resolution rule, 
number of memory cells, or number of processors. 0 1991 Academic Press, Inc. 
1. INTRODUCTION 
A parallel random access machine (PRAM) consists of many processors 
which communicate by reading and writing globally shared memory cells. 
Each processor is given a small piece of the input, and the output appears 
in the local memory of one or more processors. No restrictions are placed 
on local storage or local computing power; rather, the machine’s resources 
are measured by the number of processors and the number of global 
memory cells. This powerful and non-uniform model is useful for examin- 
ing the inherent parallel communication requirements of problems. 
The PRAM model has been extensively studied under a variety of 
assumptions on the input format and resolution of memory access conflicts; 
see for example [Beame and Hastad, 1987; Cook, Dwork, and Reischuk, 
1986; Fich, Meyer auf der Heide, Radge, and Wignerson, 1985; Kucera, 
1982; Li and Yesha, 1986; Meyer auf der Heide and Reischuk, 1984; Snir, 
1985; Lev, Pippenger, and Valiant, 19811. All of these authors, with the 
exceptions of Snir, and of Lev, Pippenger, and Valiant, have assumed that 
memory cells can hold arbitrarily large values. In this paper, we introduce 
* This research supported by a University of Toronto Connaught Scholarship. 
259 
0890~5401/91 $3.00 
Copyright 0 1991 by Academic Press, Inc 
All rights of reproduction in any form reserved 
260 STEPHEN J. BELLANTONI 
a model in which only a restricted number of bits can be held in each 
memory cell. This wardsize of the memory cells is a resource which might 
distinguish between functions requiring various types of communication. 
Although it is standard practice to mention the wordsize used in a given 
upper bound, this paper is the first to systematically treat wordsize in a 
lower-bounds context on the general PRAM model. Our model differs 
substantially from that of Snir [ 19851, where values are classified as either 
control vulues, which are restricted, or input values, which are not. The 
model introduced by Lev, Pippenger, and Valiant [ 19811 is a special case 
of our model using exclusive writes and cell size O(log n). They suggest that 
the model is a good one for formalizing upper bounds, and briefly state 
that it is equivalent within a logo”) n time factor to several switching 
network models. More recently, Ho and Snyder [ 19891 have considered 
the effect of wordsize on the time dilation of scheduling T input-oblivious 
threads on P 6 T processors. 
A strong motivation for examining wordsize is the possibility of super- 
logarithmic lower bounds on small-wordsize machines; this contrasts with 
the [log, n] upper bound for all problems on the unrestricted model. From 
a complexity viewpoint, this possibility is significant because restricting 
wordsize does not prevent the PRAM from being a general-purpose model 
of computation. Simulations of boolean circuits by PRAMS, for example, 
use wordsize 1 [Stockmeyer and Vishkin, 19841. 
The restricted wordsize model may also help us understand differences in 
the parallel communication required to solve different problems. 
Intuitively, we would like to distinguish between problems which require 
sending a lot of information during each communication event; and 
problems which, while needing the same number of events, only require a 
few bits to be sent each time. 
A restriction on the wordsize can also be justified by implementation or 
hardward limits. For example, many synchronous implementations of 
shared memory have constant bandwidth and a global clock which at con- 
stant intervals informs processors that they must begin writing or finish 
writing. Since the resources available for writing are fixed, processors can 
only write a fixed amount of information. In other words, a limit on 
wordsize can be construed as a limit on the bandwidth and time available 
for writing. 
In this paper we first show that (within a certain range of memory sizes) 
two machines that have the same total number of bits of memory, but 
different wordsizes, are incomparable. For example, a decrease in the 
wordsize cannot always be compensated for by an increase in the number 
of memory cells. Next we consider infinitely many memory cells, and show 
that when the wordsize is restricted almost all problems become harder to 
compute, even in the presence of a much more powerful write resolution 
PRAMS WlTH BOUNDED MEMORY WORDSIZE 261 
rule, extra memory cells, and extra processors. This leads to a hierarchy of 
machine classes separated by wordsize. Finally, using an efficient simula- 
tion by circuits we give a simple proof of the tight sZ((log n)/(log log n)) 
time bound for PARITY, for poly-logarithmic wordsizes and super-polyno- 
mially many processors. 
2. DEFINITIONS 
A PRAM machine for a problem of size n consists of p processors 
p pp-, 0, ..., and m memory cells Co, . . . . C, _ i ; each memory cell can hold 
a value from [2”] = {O, . . . . 2” - 11. The number of cells, m, is called the 
memory size; and the number of bits needed to represent a value stored in 
one cell, CL, is called the wordsize or memory wordsize. The resource 
measures p, m, and ~1 are regarded as functions of n. We assume there is 
at least one bit of memory, i.e., mp 3 1. The model is non-uniform in n. 
Each step of computation consists of a parallel write to memory, a parallel 
read from memory, and arbitrary local computation. Any number of 
processors may read from a given cell at a given time. The outcome of a 
concurrent write attempt on a memory cell is determined by one of the 
well-known resolution rules PRlORITY, ARBITRARY, COMMON, or 
CREW. We denote the class of PRIORITY machines with m memory cells, 
wordsize ~1, and p processors by “PRIORITY(m, cc, p).” 
A problem is a (possibly multi-valued) mapping from [2”]” to [aPI’, i.e., 
a mapping from input strings x of length II whose input words xi can be 
represented in v bits, to output strings y of length r whose output words 
y, can be represented in p bits. Input XE [2”]” is input to the machine if 
xi is stored in the local memory of processor P,, for iE [n]; when the 
machine has halted, y E [2”]’ is the output if yi appears in the local 
memory of processor P,, for iE [r]. Thus we require at least as many pro- 
cessors as input or output words: p > n, r. A problem with size parameters 
(n, v, r, p> is said to be a small-IO problem for wordsize ~1 if there is one 
output (r = l), and the input and output wordsizes (v and p) are no longer 
than the memory wordsize p. Small-IO problems are of special interest 
because for other problems, processors are unnaturally overwhelmed with 
input or output. 
We are charging one unit of cost for each word access. Some of the 
results apply to a model in which a unit of cost is charged for each bit in 
the word: under each of the resolution rules above, a bit-cost model with 
wordsize p can be simulated (within a cost factor of 2) in a unit-cost model 
with wordsize 1. See Aggarwal, Chandra, and Snir [ 19891 for a model in 
which the access cost for a word depends on whether the word is the first 
one to be accessed in a block of contiguous words. 
All logarithms are base 2. 
262 STEPHEN J.BELLANTONI 
3. INCOMPARABILITY 
Can a decrease in the wordsize of a machine be compensated by increas- 
ing the number of memory cells? At least when the memory size is small, 
the answer is no. 
Consider two machines, the first having one ceil holding 1 bits, and the 
second having 1 cells of 1 bit each. Although the two machines can store 
the same amount of global information, it seems that the two different 
memory configurations will be useful for different types of problems. In 
this section we find two problems which have different inherent com- 
munication patterns in the following sense: the time complexity of one 
problem is primarily sensitive to changes in the wordsize, whereas the time 
complexity of the other problem is sensitive only to changes in the number 
of memory cells. A small number of large cells is “good” for one problem 
but not the other, and vice-versa. The approach of these proofs is along the 
lines of the “small-memory” proofs of [Fich, Ragde, and Wigdersson, 
1984); however, the restricted wordsize enters into the lower bound in a 
crucial way. 
The first problem is AND OF ORS: the input positions are divided into 
Ln/% J almost equal blocks, and on input x E (0, 1 }” the boolean output is 
1 if and only if every block contains some input position i with xi = 1. That 
is, the output is the AND of the OR of the input words in each block. 
The second problem is REPRESENTATIVE: output y E { 0, 1)” is correct for 
input XE (0, 1 }” if y is zero everywhere except for exactly one position in 
{ i( xi = 1 }; or if x and y are both zero everywhere. This problem can be 
thought of as selecting a representative from the set of input words equal 
to “1.” 
Lower bounds for these two problems are shown in the following 
lemmas. 
LEMMA 3.1. The time complexity of AND OF ORS on PRIORITY (m, p, p ) 
is @((n/m) + 1). 
LEMMA 3.2. The time complexity O~REPRESENTATIVE on PRIORITY (m, p,p) 
is @((log n)/(p + log m) + 1). 
Within a certain range of m and p, AND OF ORS becomes much easier to 
compute as the number of memory cells is increased, and is completely 
insensitive to changes in the wordsize. On the other hand, in the same 
range REPRESENTATIVE is quite sensitive to changes in the wordsize and 
relatively insensitive to changes in the number of memory cells. For exam- 
ple, REPRESENTATIVE can be solved Q( (log n)/(log log n)) times faster on a 
PRIORITY( 1, log n, p) machine than on any PRIORITY(log n, I, p) 
PRAMSWITHBOUNDED MEMORYWORDSIZE 263 
machine; yet AND OF ORS can be solved Q(log n) times faster on a 
PRIORITY(log n, 1, p) machine than on any PRIORITY( 1, log n, p) 
machine. This shows that these two machine classes, although capable of 
storing the same amount of shared global information and having the same 
number of processeors, are incomparable. This result also holds on the 
weaker ARBITRARY model. 
THEOREM 3.3. PRIORITY(m, p, p) and PRIORITY(m’, $, p) are 
incomparable when mp =m’$ but p E o(p’), provided that m E O(n), 
p’ E O(log n), and log m’ E o( p’). 
ProoJ It is easy to check that, under the conditions stated in the 
theorem, m’ E o(m) and p + log m E o( p’ + log m’). Applying Lemmas 3.1 
and 3.2 gives asymptotic separations in both directions. First note that (A) 
m’Eo(m) and (B) logm=logm’+logp’-logpEo($). Now verify that 
the lemmas apply (14 below) and that the separation is nontrivial (5-6 
below): 
1. m E O(n): direct. 
2. m’EO(n): by (A), m’Eo(m)cO(n). 
3. $+logm’EO(logn): logm’Eo($)EO(logn). 
4. p+log rnE O(log n):p(Eo(p’); by (B), log mco(p’); and use 
p’ E 0( log n). 
5. m’ E o(m): immediate from (A). 
6. ZJ + log m E o($+log m’): PE o($) is direct and log me o($) 
is (B). 1 
The type of adversary argument seen in [Fich, Ragde, and Wigderson, 
19841 is used to prove the lower bounds of Lemma 3.1. Let the history up 
to time t, H’, be a record of all the contents of all memory cells up to time 
t during computation on a given input x. The adversary selects a set of 
allowable inputs such that H’ is constant for all allowable inputs x. Since 
at time t the state of a processor depends only on its input word and H’, 
on allowable inputs processors “know only their private inputs:” the state 
of processor i after step t is the same on all allowable inputs x which have 
the same value for xi. 
The allowable inputs at time t are described by a partial assignment Z’c 
(0, 1, * }” to the input positions: x is an allowable input if, for all i, either 
Zf = * or xi = Zf. Index i is a free position if Z: = *. The allowable inputs at 
time t + 1 are always constructed as a subset of the allowable inputs at time 
t, by fixing free positions in I’ to 0 or 1; this is done in such a way that 
if H’ was determined, then H’+ l is determined. If the PRAM halts at time 
t with the correct output, but a few positions remain unfixed, then a 
264 STEPHEN J. BELLANTONI 
contradiction can be found: there would be two allowable inputs x and x’ 
for which the outputs differ at an output position i, but such that xi = xl. 
This outline follows that of [Fich, Ragde, and Wigderson, 19841. 
Proof of Lemma 3.1. For the AND OF ORS problem, ZZ’+ ’ is determined 
as follows: for each cell Cj in turn, consider the last numbered processor, 
Pi, which on some allowable input writes to Cj at time t + 1. Fix position 
i in I’ so that Pi writes; then fix all other positions in the same block as 
i to 1. This fixes at most O(m) positions for each time step, so it can be 
repeated Q(n/m) times while still leaving some input positions unfixed. The 
lower bound follows by considering the two allowable inputs in which the 
free positions all contain 1, or all contain 0; these have outputs 1 and 0, 
respectively. 
For the upper bound, simply use [n/m] steps to compute the ORs, m at 
a time; then use one step to AND the results together. 1 
In order to show a lower bound for a problem sensitive to the wordsize, 
as in Lemma 3.2, this technique must be applied with respect to both the 
memory cell to be written and the value to be written. Interestingly, Fich, 
Ragde, and Wigderson [1984] show a lower bound of (log n)/(log(m + 1)) 
for REPRESENTATIVE on the COMMON model. The COMON rule makes it 
difficult to have a situation in which many processors could possibly write 
different values to a cell, because each writing processor must know that 
the other processors will not attempt to write a conflicting value. In con- 
trast, the ARBITRARY and PRIORITY rules provide conflict resolution. 
Lemma 3.2 suggests that this conflict resolution is much less useful when 
the wordsize is restricted: the algorithm cannot use the value appearing in 
a cell to select one of many writing processors. In other words, the 
PRIORITY rule makes the adversary’s task harder, a problem which can 
be overcome by taking advantage of the limited wordsize. 
Proof of Lemma 3.2. Allowable inputs are again specified by a partial 
assignment I’, except that the all-zero input is excluded from the set of 
allowable inputs. Positions will only be fixed to zero, starting with Zp = * 
for all i. The free positions at time t may be scattered anywhere among the 
input positions. 
At each step of the construction, at least 1/(2(2”m + 1)2) of the free posi- 
tions in I’ will remain free in I’+ i. The construction proceeds so long as 
there are 2(2”m + 1)2 free positions. In order to fix this many input posi- 
tions, at least (log n)/(log(2(2Pm + 1)‘)) - 1 E Q((log n)/( p+ log m)) steps 
are required. On the other hand, when the construction halts there are at 
least two free positions, k and 1. 
Write xs for the input such that xf = 1 if and only if i E S. Suppose that 
PRAMSWITHBOUNDED MEMORYWORDSIZE 265 
at time t there are two free positions, k and 1. Since H’ is constant, a pro- 
cessor’s state only varies with its private input. E.g., P, is in the same state 
at time t on input x jk) as on input xlk,‘j; and PI is in the same state at time 
t on input xi’) as on input x tk4 Therefore, it cannot be true that Pk halts 
by time t on input x Ik) (with the correct answer “1”) and P, halts by time 
t on input x 1’) (with answer “1”); because then they would both halt by 
time t with answer “1” on input x (k,‘j Hence a correct algorithm for . 
REPRESENTATIVE cannot halt while there are two or more free positions. 
Consider a single step of the computation, at time t. Assuming that H’ 
has been determined, we show how to determine H’+’ by fixing positions 
which are free in I’. By the comments above, this will prove the lower 
bound. 
First we find a large set of processors all writing the same values to 
memory. Define Q = ( (j, u) ) jE Cm] and u E [2”]} u (A>. Each element of 
Q defines, in a natural way, a write which can be performed by a processor. 
More precisely, for a given processor P, and any a E Q, we say that Pi per- 
forms write a if Pi does not write, and a = i; or if P, writes u to cell Cj, 
and a= (j, u). We say that aeQ changes Cj if a= (j, u) for some U; we 
say that a and b conflict if a = (j, u), b = (j, ~1) for some j, u’, u with 
wzu. 
For each a, b E Q, define Wi = {i/ If = *, P, performs write a at time t on 
input word 0, and Pi performs write b at time t on input word 1). Since 
each free position is put into one of these (2pm + 1)’ sets, some set Wi has 
at least 1/(2”m + l)* of the free positions. 
Consider any cell Cj not changed by a or b. Fix all positions outside of 
Wi to zero. Any processor which writes to C, is in a position fixed to zero, 
so Cj is determined at time t + 1 on all allowable inputs. 
Consider any cell Cj changed by a. Fix position I = min( Wi) to zero. 
(The PRIORITY rule selects P,i”(z) to win a write conflict between 
{ PiI ie Z}.) Cell C, is determined at time t + 1 by the fact .that sone 
processor numbered not higher than I writes to Cj on input word 0, and 
all such processors are in positions fixed to 0. 
Finally, consider any cell C, changed by b but not changed by a. On 
every allowable input some processor in Wt has input word 1, and there- 
fore some processor attempts to perform write b. Let h be the least position 
(if any) such that on some allowable input, P,, attempts a write, c, conflict- 
ing with b. The input at position h has been fixed, since c # b and c # a. 
Now consider whether or not more than half of Wb is less than h: let 
w’ = (ie Wi ( i < h}. If # w’ > # Wt/2, then fix all pos:tions outside of W 
to zero. On allowable inputs, some processor in IV’ is thereby required to 
have input word 1; so some processor in W’ attempts to perform write b. 
By the minimality of h, write b will always be performed. On the other 
hand, if # W’ d # Wt/2, then fix every position in W’ to zero. Some 
266 STEPHEN J.BELLANTONI 
processor numbered not higher than h writes to Cj, and all such processors 
are in fixed position. In either case, C, is determined, 
This completes the construction of Z’+ ‘. The number of free positions 
remaining is f’+ ’ 2 y/2(2Pm + 1)‘. 
To prove the upper bound of Lemma 3.2, it is sufficient to find an upper 
bound for all m < &, because the lower bound for REPRESENTATIVE is 
bounded by a constant for all m E Q(4). Repeat the following two phases. 
First, divide the processors into m almost equal blocks, and select the 
leftmost block, B, in which some processor has input word 1. This can be 
done in constant time using m cells and m2 processors [Kucera, 19823. 
Next, divide B into 2” subblocks and select a subblock in which some pro- 
cessor has input word 1; this can be done in one step by writing to a single 
cell. Processors in the selected subblock continue in the next round; other 
processors stop. The number of active processors is reduced by a factor of 
m2P on each round. 1 
4. INFINITE MEMORY 
The idea of the previous proof can be summarized as follows: when there 
are not very many cells, either a lot of processors are idle or a lot of 
processors are involved in a write conflict. If a lot of processors are idle 
then not much progress is made; but if there is a large write conflict then 
because of the limited wordsize, relatively little information is gained (even 
using the PRIORITY rule), and again not much progress is made. Unfor- 
tunately it is difficult to construct a proof such as that in Section 3 when 
the number of memory cells is infinite, because we do not know a priori 
that large write conflicts will occur. 
To get a bound for infinitely many memory cells, observe that for a given 
input word xi, after step t in a machine with wordsize p, processor Pi can 
be in one of at most (2“)’ different states. In the unrestricted model the 
analogous bound [Beame, 19861 is 2 o(“) states, which suggests that pro- 
cessors in the restricted model have much less information than processors 
in the unrestricted model. We can use this fact in a counting argument to 
prove a lower bound for almost all problems, on machines with an infinite 
number of memory cells. (Here, “almost all” means a fraction of at least 
1 - z-z”“-‘). 
THEOREM 4.1. Almost all small-Z0 problems require time at least 
nv/,u - 2(log p)/p- 8 to solve on any PRIORITY machine with infinitely 
many memory cells, wordsize p, and p processors. 
For example, almost all small-IO problems require super-logarithmic 
time to solve on any PRIORITY machine with wordsize 11 E o(nv/(log n)), 
PRAMS WITH BOUNDED MEMORY WORDSIZE 267 
even allowing an infinite number of memory cells and any polynomial 
number of processors. This contrasts sharply with the [log nl upper bound 
for an EREW machine with wordsize p = nv, and only n processors and 
memory cells. 
Theorem 4.1 is proved by applying the following technical lemma, which 
shows a direct trade-off between the time, memory wordsize, and number 
of processors used to solve even a tiny fraction of the possible problems for 
a given size. 
LEMMA 4.2. Whenever 
(T+l)p+logp<(n-l)v-logmax(nv,log(rp))+logr-5, 
the fraction of problems with size parameters (n, v, r, p > solvable in time T 
on PRIORITY machines with infinitely many memory cells, wordsize p, and 
p processors is at most 1/2’p2”-‘. 
Proof: The number of problems is 2rp2”Y. We show that any machine 
described in the lemma can be encoded using at most rp2”‘-’ bits. Hence 
only (yP2”~-‘/p2”“) = ~/2’P2”~‘-’ of the problems are solved by such 
machines. 
On each step, a processor reads p bits of information and then changes 
state. Thus, for a given input state, after t steps a processor is in one of at 
most 2’” possible states. Since each of the p processors has at most 2’ input 
states, there are at most p2’C,‘, r 2’!-‘<~2(‘+‘)~+“= S states which are 
ever entered by any processor on any input; this also bounds the number 
of memory cells which can be accessed. 
The encoding begins with the values of p, p, and T in self-delimiting for- 
mat. This leading information requires at most log p + log p + log T< log S 
bits, and determines the format of the remaining bits. Renumbering the 
memory cells appropriately, for each possible state of each processor we 
encode the index of the cell written to from that state, the value written, the 
index of the cell read, and possibly an output value, in (2 log S + p + p) < 
4p log S bits per state. Altogether there are at most 4pSlog S+ log S< 
5pS log S bits in the encoding. 
Using the estimates log max(nv, log(rp)) > log(nv + log(rp)) - 1 and 
4 > log 10, expanding the hypothesis gives 
(T+1)~+v+logp+logp~nv-log(nv+log(rp))+log(rp)-log10 
< log(rp2”“/10) - log(nv + log(rp)) 
< log(rp2”‘-‘/5) -log log(rp2”‘- l/5) 
The first term in this chain of inequalities is an upper bound on log pS. The 
result follows, using the fact that log A <log B-log log B implies 
AlogA<B. 1 
268 STEPHEN J. BELLANTONI 
The restricted wordsize is used in this lemma to obtain S, estimating the 
number of possible states reached during a computation. The lower bound 
nv/p - 2 log p/p - 8 of Theorem 4.1 follows easily: the condition 
required by Lemma 4.2 for small-10 problems, is implied by 
T<nv-lJ”gp l”gp 1-5-l 
‘P 
---- 
P P 
The lower bound of Theorem 4.1 is dominated by nv/p. This is the time 
it would take for processors to publicize the entire nv input bits sequen- 
tially, assuming each processor had already gathered p bits of the input. 
This suggests the following upper bound, whose proof shows a smoothly 
increasing use of the “fan-in” algorithm as p increases toward nv. 
LEMMA 4.3. For any small-IO problem, any ,a 2 1, and any p < 2nv’2, 
there is a CREW machine with wordsize p, and p processors and memory 
cells, which solves the problem in at most nv/p - (log p),lp +log(p/v) + 5 
steps. 
Proof Allocate the input into rnv/bl blocks each containing p con- 
secutive bits; pad the last block with 0 bits as needed. Within each block 
i, compress the input words into a single value expressible in p bits and 
store the result in processor i; this takes time at most rlog(&)l. 
Let “left” denote the first k = L(log p - 1)/~_1 processors, and let “right” 
denote the other processors. Note that there are at most rnv/21/2 left 
processors, because log p < nv/2. Sequentially, each right processor having 
the value for a block writes that value, and other right processors read it. 
Simultaneously the left processors also publicize their values among them- 
selves. Thus the right and left processors obtain values R and L corre- 
sponding to all of’their input bits, in rnv/pl -k steps. All memory cells are 
set to 0 and one of the left processors writes 1 to the Lth bit of memory 
in the cells k + 1, . . . . p. This is possible because L < 2kP d p/2 6 p-k. Each 
right processor Pi reads from cell Ci; one of them reads a cell containing 
the written bit and is able to calculate L from the position of the bit. It 
then reconstructs the original input from L and R, calculates the answer, 
and writes it to cell Co where it is read by the output processor P,. The 
stated time bound follows. 1 
This upper bound matches the lower bound of Theorem 4.1 closely 
enough to prove the following separation result: 
PRAMS WITH BOUNDED MEMORY WORDSIZE 269 
THEOREM 4.4. Let p 6 p’ E o(nv/log(nv)). Almost all small-IO problems 
are solved sZ(p’/p) times faster on some CREW(m’ = n, p’, p’ = n) machine 
than on any PRIORITY(m = co, p, p < 2nv’4) machine. 
Proof. We would like to show that, within a constant factor, the lower 
bound of Theorem 4.1 exceeds $/p times the upper bound of Lemma 4.2. 
Because p 6 2”“14 and p’ > 1, we can find a constant c satisfying 
nv 2 1% P -- -_ 8>cc y-log P’ 
P P ( P P’ 
-+log$+5 
PI > 
for all n sufficiently large by finding c satisfying 
Let k > 0 be a constant such that p’ < knv/log(nv) for all sufficiently large 
n. Then the latter inequality is implied by 
Izli, cknv 
P P Wnv) 
1% 
This condition is satisfied by choosing c< 1/2(k + l), because 
log(kn/log(nv)) + 5 Q log(m) for all n sufficiently large; and because 
l/2 - c > l/2 and ck < l/2 imply that 
In other words, for almost all problems, a change in the memory 
wordsize has a direct effect on the time needed to. solve the 
problem-regardless of write concurrency, extra memory, and extra pro- 
cessors. This theorem holds for wordsizes which are o(nv/log(nv)); for 
larger wordsizes a similar theorem holds but the separation proved is much 
smaller. 
LEMMA 4.5. Let p < p’ and $ E SZ(nv/log(nv)). Almost all small-IO 
problems are solved Q(nv/p log p’) times faster on some CREW(m’ = n, p’, 
p’ = n) machine than on any PRIORITY(m = 00, p, p d 2”“j4) machine. 
ProoJ Proceeding as before, we need 
nv -- 210gP 8,c nv 
P --‘jAog$/f’ $ P ( 
nv 
---+logd+5 . 
log p’ 
V > 
270 STEPHEN J. BELLANTONI 
Note that nv/p’ E O(log(nv)) E O(log(nv) - log log(nv)) c O(log p’). Let 
k > 0 be a constant such that 
nv 
ii+log$+5<klogP~ 
for large enough n. Using p d 2”“‘4 and p’ 2 1, the first inequality follows 
directly from the fact 
nv,, 
a 
--J&Y (k log 14 + 8 
for large enough n, by choosing c < 1/2k. 1 
5. SIMULATION BY CIRCUITS 
In this section we show how to simulate PRIORITY(co, p, p) machines 
which halt in time T by unbounded fan-in circuits of depth O(T) and size 
roughly ~o(‘)2~p. In contrast, the simulation for normal PRAM machines 
uses size roughly p”. Using Hastad’s (log n)/(log log n) depth bound 
[Hastad, 19861 for circuits computing PARITY, our simulation gives a sim- 
ple proof of the Q((log n)/(log log n)) time bound for computing PARITY on 
PRAMS, for memory wordsizes up to logo(‘) n, infinitely many memory 
cells, and a super-polynomial number of processors. The matching upper 
bound [Beame and Hastad, 19871 uses wordsize 1. 
Although this result is implied by the sophisticated result of Beame and 
Hastad [1987], it is interesting to see that for a large class of machines the 
optimal time bound can be obtained by a very simple argument. The best 
result for PARITY whcih can be obtained by applying this simple approach 
to the original (infinite wordsize) model seems to be the Q(s) bound 
of Li and Yesha [1986]. 
As a preliminary, define a memoryless machine to be a machine such that 
after each step consisting of a parallel write and parallel read, the entire 
memory is reset to zero. It is easy to see that any PRIORITY(co, p, p) 
machine it4 halting in time T can be simulated in time T+ 1 by a 
memoryless PRIORITY(co, p, pT) machine M’. Using p processors for 
each of the T time steps, processor p( T - t) in M’ simulates processor p in 
M up to step t, and thereafter repeatedly performs the same write which p 
performed at time step t. 
LEMMA 5.1. If small-IO problem f is computed by a 
PRIORITY( co, p, p) machine halting in time T, then f is computed by an 
unbounded fan-in circuit of depth O(T) and size p”(1)20(TP’. 
PRAMSWITHBOUNDEDMEMORYWORDSIZE 271 
Proof We show that a memoryless machine with pT processors can be 
simulated in depth O(T) and size (PT)~“‘S(T) log S(T), where 5’(t) = 
~2(‘+~)‘. Since this is dominated by p°C’)20(T~), the lemma follows. 
To perform the simulation, we construct a constant depth small size cir- 
cuit which simulates one step of the computation. Recall that up to step t, 
there are at most ~2” + l)p + ’ < S(t) different states which processors could 
have entered. This also bounds the number of cells which could be 
accessed. Stage t of the circuit requires as input 3 log S(t) + p gates for each 
processor, which code (in binary) the number of the memory cell written 
to; the number of the memory cell read; the value written; and the pro- 
cessor state. 
To simulate one step of the computation for a given processor Pi, first 
compare the number of the cell read by Pi for equality with the number of 
the cell written by P,, Jo [pT]. Next we find the leftmost such comparison 
which succeeded. Using the result of [Chandra, Fortune, and Lipton, 
19831 for prefix-OR computation, this can be done in constant depth and 
size linear in (pT). This output is then used to select the value read by Pi 
from among all values written by processors. Placing the gates representing 
the value read next to the gates representing the current state gives 
p + log s(t) = log s(t + 1) gates representing the processor’s new state. This 
new state can then be used to select one of at most S(t + 1) constant 
vectors representing the possible write/read access information for the next 
stage (using 2 log S(t + 1) + p + 0( 1) gates per vector). It is easy to see that 
the size bound is met. 1 
THEOREM 5.2. PARITY requires time o((log n)/(log log n)) to solve on a 
PRIORITY ( CO, u, p) machine, where p E 210go”)fl and p E logo”’ n. 
Proof. Suppose that M is a machine with p E 2’ogo(“n processors, solving 
parity in time TE o((log n)/(log log n)) and wordsize ,u E: logo”’ n. 
Using Hastad’s 2(1/lo)n”“‘T’ size lower bound for circuits computing PARITY 
in depth T [Hastad, 19861 and the lemma above, T satisfies ~‘2’~p > 
2(1’10)“1”cr’, where c is some constant greater than 1. Taking logarithms 
twice, this implies log( Tp + log p) E Q( (l/T) log n). 
The conditions on T, p, and p imply that Tp + log p~logO(‘) n, so 
log( Tp + log p) E O(log log n). On the other hand, ((l/T) log n) E 
o(log log n), contradiction. 1 
6. CONCLUSION 
The restricted-wordsize PRAM with a variety of access resolution rules 
has been introduced as a general-purpose model of parallel computation 
272 STEPHEN J. BELLANTONI 
for studying lower bounds. Memory wordsize is a resource which can be 
used to distinguish between functions having various communication pat- 
terns: some functions seem to require large wordsize, while other functions 
do not. Adversary arguments have been used to show the incomparability 
of machine models which store the same amount of global information but 
which differ in wordsize. A counting argument has been used to show a 
large lower bound for almost all problems, which implies separation results 
for the wordsize hierarchy of machine classes. The separation between two 
machine classes holds even when powers of write concurrency, extra 
memory, and extra processors are given solely to the machines having 
small wordsize. A simulation by circuits gives a simple new proof of the 
optimal time bound for PARITY, on machines with poly-logarithmic 
wordsize, a super-polynomial number of processors, and infinitely many 
memory cells. 
Many areas of investigation remain. For example, lower bounds for 
specific problems on small wordsize machines with Q(n) cells should be 
proved directly on the PRAM model, rather than by simulation. New 
techniques in this area would be especially interesting in view of the dif- 
ficulty of proving analogous results in the unrestricted model. The large 
lower bound of Theorem 4.1 makes super-logarithmic bounds for specific 
problems seem an enticing possibility. Such a result would separate the NC 
hierarchy if the problem were in NC. 
The incomparability result of Theorem 3.3 rests on a lower bound for 
REPRESENTATIVE which is primarily sensitive to wordsize but also somewhat 
sensitive to the number of cells. It would be interesting to show a result 
which has no such residual sensitivity to the number of cells. Such a result 
would probably hold for Q(n) memory cells, and therefore probably 
requires new techniques. 
The wordsize limit of nv/log(nv) in the hierarchy separation result of 
Theorem 4.4 should be investigated. Is it a natural dividing point, or is its 
appearance an artifact of the proof? Is it coincidental that wordsize 
nv/log(nv) is the point at which the bound of Theorem 4.1 becomes super- 
logarithmic? 
On a more general level, unbounded fan-in remains an unrealistic 
assumption of the PRAM model. The overall relationship of PRAM 
wordsize to established parallel complexity classes such as NCk is open. 
ACKNOWLEDGMENTS 
I thank Faith E. Fich for the many fruitful discussions and suggestions which made this 
work possible. I thank Pierre Kelsen for pointing out errors in an earlier draft of this paper. 
RECEIVED September 19, 1988; FINAL MANUSCRIPT RECEIVED December 18, 1989 
PRAMS WITH BOUNDED MEMORY WORDSIZE 273 
REFERENCES 
AGGARWAL, A., CHANDRA, A. K.. AND SNIR, M. (1989), On communication latency in 
PRAM computations, in “Proc. 1st Annual ACM Symposium on Parallel Algorithms and 
Architectures.” 
BEAME, P. (1986), Limits on the power of concurrent-write parallel machines, in “Proc. 18th 
ACM Symposium on Theory of Computing,” pp. 169-176. 
BEAME, P., AND HASTAD, J. (1987). Optimal bounds for decision problems on the CRCW 
PRAM, in “Proc. 19th Annual ACM Symposium on Theory of Computing,” pp. 83-93. 
COOK, S., DWORK, C., AND REISCHUK, R. (1986), Upper and lower time bounds for parallel 
random access machines without simultaneous writes, SIAM J. Cornput. 15, 87. 
CHANDRA, A.. FORTUNE, S., AND LIPTON, R. (1983). Unbounded fan-in circuits and 
associative functions, in “ Proc. 15th ACM Symposium on Theory of Computing,” 
pp. 52-60. 
FICH, F. E., MEYER AUF DER HEIDE, F., AND WIGDERSON, A. (1987). Lower bounds for 
parallel random access machines with unbounded shared memory, Ado. Comput. Rex 
4, 1. 
FICH, F. E., MEYER AUF DER HEIDE, F., RAGDE, P., AND WIGDERSON, A. (1985), One, two, 
three... infinity: Lower bounds for parallel computation, in “Proc. 17th ACM Symposium 
on Theory of Computing,” pp. 48-58. 
FICH, F. E., RAGDE. P., AND WIGDERSON, A. (1984), Relations between concurrent-write 
models of parallel computation, in “Proc. 3rd Annual ACM Symposium on Principles of 
Distributed Computing,” pp. 179-189. 
HASTAD, J. (1986), Almost optimal lower bounds for small depth circuits, in “Proc. 18th ACM 
Symposium on Theory of Computing,” pp. 6-20. 
Ho, S., AND SNYDER, L. (1989), Are bit serial or word parallel computers faster? (private 
communication). 
KUCERA, L. (1982), Parallel computation and conflicts in memory access. Inform. Process. 
Left. 14, 93. 
LFV, G. E., PIPPENGER, N., AND VALIANT, L. G. (1981), A fast parallel algorithm for routing 
on a permutation network, IEEE Trans. Comput. 30, 2-93. 
LI, M., AND YESHA, Y. (1986), New lower bounds for parallel computation, in “Proc. 18th 
ACM Symposium on Theory of Computing,” pp. 177-187. 
MEYER AUF DER HEIDE, F., AND REISCHUK, R. (1984), On the limits to speed up parallel 
machines by large hardware and unbounded communication, in “Proc. 25th IEEE 
Symposium on Foundations of Computer Science,” pp. 5664. 
SNIR, M. (1985), On parallel searching, SIAM J. Comput. 14, 688. 
STOCKMEYER, L., AND VISHKIN, U. (1984), Simulation of parallel random access machines by 
circuits, SIAM J. Comput. 13, 409. 
