Limits on the power of concurrent-write parallel machines  by Beame, Paul
INFORMATION AND COMPUTATION 76, 13-28 (1988) 
Limits on the Power 
of Concurrent-Write Parallel Machines* 
PAUL BEAME+ 
Laboratory for Computer Science, Massachusetts institute of Technology, 
545 Technology Square, Cambridge, Massachusetts 02139 
The computation of specific functions using the most general form of concurrent- 
read-concurrent-write parallel RAM is considered. It is shown that such a machine 
can compute any function of Boolean inputs in time log n - log log n + 0( 1) given a 
polynomial number of processors and memory cells and that this bound is tight for 
integer addition. Despite this evidence of the power of the model we show that a 
very simple function, namely parity, requires time Q (Jiogn) to compute given a 
polynomial bound on the number of processors, independent of the number of 
memory cells. 13” 1988 Academic Press, Inc 
1. INTRODUCTION 
Parallel random access machines (PRAMS) are well-accepted as good 
models for parallel computation. The procedural nature of the way in 
which they compute and the relatively natural way of enforcing uniformity 
conditions on them have made them popular for the design of parallel 
algorithms. They consist of many processors acting in consort and com- 
municating through some shared memory. They operate much like familiar 
sequential RAMS except for the rules for concurrently accessing the shared 
memory. There are three main classifications of these rules: exclusive 
readexclusive write (EREW), concurrent read-exclusive write (CREW), 
and concurrent read-concurrent write (CRCW). The most powerful of 
these machines (CRCW) can be simulated by the weakest (EREW) with a 
delay per step proportional to the logarithm of the number of processors. 
Cook, Dwork, and Reischuk (1984) have shown that either the EREW 
or the CREW PRAMS require @(log n) time to compute the OR of n bits. 
Their lower bound holds independently of not only the number of 
processors or the size of the shared memory but also the way the program 
is specified and the instruction set of a processor. It really says something 
about the restrictive nature of the communication itself. 
* This work was done while the author was at the University of Toronto. 
+ Current address: Department of Computer Science, FR-35, University of Washington. 
Seattle, Washington 98195. 
13 
0890-5401/88 $3.00 
Copyright ‘! i 1988 by Acadermc Press. Inc. 
All rights of reproductmn in any lorm reserved 
14 PAUL BEAME 
The OR of n bits can easily be computed in constant time on a CRCW 
PRAM and it is easy to see how with sufficiently many processors and cells 
any Boolean function can be computed in constant time. It is natural then 
to consider CRCW PRAMS which have resources bounded by a 
polynomial in the size of the input and try to prove lower bounds similar 
to those for PRAMS which have only exclusive writes. Previous lower 
bounds for CRCW PRAMS have approached this in different ways. Some 
have put extremely stringent restriction on the number of processors or the 
size of the shared memory, e.g., Vishkin and Wigderson (1985) or Fich, 
Meyer auf der Heide, Ragde, and Wigderson (1985). Other bounds which 
hold with a polynomial number of processors or cells but which put severe 
restrictions on the instruction sets of the PRAMS are due to Stockmeyer 
and Vishkin (1984) and Meyer auf der Heide and Reischuk (1984). Other 
more powerful primitives than those they allow seem to be perfectly 
reasonable since the cost of concurrent accesses is presumably significantly 
greater than that of local computations. A symptom of the restricted nature 
of these CRCW PRAMS is that most Boolean functions require time 
Q(2’12) to compute given a polynomial number of processors and cells 
whereas all Boolean functions can be computed in @log n) time by the 
EREW or CREW machines of Cook, Dwork, and Reischuk using only ir 
processors and cells. 
We consider computation of specific functions for the most general form 
of CRCW machine which we call the Abstract CRCW PRAM. We first 
show that such a machine can compute any function of Boolean inputs in 
time log n - log log n + 0( 1) given a polynomial number of processors and 
cells and that this bound is tight. Essentially the same machines are con- 
sidered by Meyer auf der Heide and Wigderson (1985) in which the 
authors prove an R(&) time lower bound for sorting n integers and a 
@(log n) lower bound for adding n integers. Their results rely on Ramsey 
theory and as a consequence their bounds only hold for integers which are 
extremely large-so large that the problems in question are not polynomial 
time computable given a sequential machine with an honest complexity 
measure like the log-cost RAM. 
We show that on the Abstract CRCW PRAM the sum of n numbers 
requires @(log n) time even if for example the numbers have only n bits 
each. Using a result of Hastad ( 1986) concerning unbounded fan-in cir- 
cuits, the main bound we prove is that on this extremely powerful model 
some very simple functions on (0, 1)” require time L?(s) to compute 
given a polynomial bound on the number of cells and processors. These 
functions include computing the parity of, sorting, or adding n bits as well 
as multiplying two n/2-bit integers. These results are the first non-trivial 
lower bounds for computing Boolean functions on CRCW PRAMS which 
do not rely on restricted instruction sets of processors or resources smaller 
CRCW PRAM LOWER BOUNDS 15 
than the size of the problem input. They show that there is something 
inherent in the ways in which processors communicate with each other 
which limits their computational power. 
2. THE ABSTRACT CRCW PRAM 
DEFINITION. An (Abstract) CRCW PRAM is a shared memory machine 
with processors P,, . . . . Ppcnj which communicate through memory cells 
C L, .-., CC,“). The input is initially stored in the first n cells of memory, 
C 1, ..., C,. Initially all cells other than the input cells contain the value 0. 
The output of the machine is the value in the cell C, at termination time 
T(n). 
Before each step t processor Pi is in state ql. At time step t, depending on 
qi, processor Pi reads some cell Ci of shared memory, then, depending on 
the contents of C, and qf, assumes a new state qf + ‘, and depending on this 
state, writes a value v = v(q:+ ‘) into some cell. 
When several processors are attempting to write into a single cell at the 
same time step the one that succeeds will be the lowest numbered 
processor. 
A processor may write anything into its cells when it writes, including a 
full description of the portion of the history of the computation which it 
knows, as well as its own processor number. When there is no danger of 
confusion we will assume that a CRCW PRAM is an Abstract CRCW 
PRAM. 
In the definition of an Abstract CRCW PRAM it becomes clear that, 
except for the final step of an algorithm in which the machine must output 
function values, the actual values of a processor’s state or the contents of a 
cell are not important. What is important is the set of inputs which lead to 
the cell contents or state. The computation then may be viewed as 
operating not on actual values as much as on the partitions associated with 
them. 
DEFINITION. Let A4 be a CRCW PRAM. For any processor Pi the 
processor partition, P(M, i, t), of the input set at time step t is defined so 
that two inputs are in the same equivalence class of P(M, i, t) if and only if 
they lead to the same state of processor Pi at the end of time step t. 
For any cell Cj the cell partition, C(M, j, t), of the input set at time t is 
defined so that two inputs are in the same equivalence class of C(M, j, t) if 
and only if they lead to the same contents of cell C, at the end of time 
step t. 
The idea that these processor and cell partitions are the crucial aspects of 
643/76:1-l 
16 PAULBEAME 
a computation is implicit in much of the lower bound work in this area. 
Snir (1985), for example, has given a formal explanation of the actions of a 
most powerful EREW PRAM which is essentially a method of describing 
the ways that the processor and cell partitions of that machine may be 
combined during a computation. 
It will be useful to use the following restrictions of the Abstract CRCW 
PRAM in order to simplify our discussions of these machines. 
DEFINITION. We say that a CRCW PRAM is a common-write CRCW 
PRAM if whenever several processors are attempting to write into a single 
cell at the same time step, the values that they are attempting to write are 
the same. 
LEMMA 2.1. [Kucera (1982)]. Let M be a CRCW PRAM with p(n) 
processors and c(n) memory cells, and taking T(n) time. Then a common- 
write CRCW PRAM can simulate M by using O(p(n)*) processors, using 
c(n) + p(n) memory cells, and taking 4T(n) time. 
Finally, for notational convenience we will assume unless otherwise 
stated that all logarithms are to the base 2. 
3. THE COMPUTATION OF ARBITRARY BOOLEAN FUNCTIONS 
AND INTEGER ADDITION 
By an argument similar to that of the Shannon bounds on combinatorial 
circuit complexity (see Savage (1976)), Ruzzo (1986) has shown that 
almost all Boolean functions require depth L2(2”j2) to be computed by 
polynomial size unbounded fan-in circuits. Then by the simulations in 
Stockmeyer and Vishkin (1984) we see that restricted CRCW PRAMS 
require time Q(2’12) to compute most Boolean functions given a 
polynomial number of processors. The standard trick used for the following 
lemma shows that Abstract PRAMS are much more powerful than PRAMS 
with restricted instruction sets. 
LEMMA 3.1. For any function f of n inputs there is an Abstract (EREW) 
PRAM M that computes f in time [log nl+ 1 using rni2-j processors an n 
cells. 
ProojY The following algorithm computes any function of n inputs. In 
the first step processor Pi reads cell C,. In the second step processor Pi 
reads cell Ci+ i and writes the pair (xi, xi+ i) into cell C;. At the end of step 
t + 1, t > 0, processor Pi will know and cell Cj will contain the 2’ inputs 
(xi, ...t xi+2’- I ). During step t + 1, processor Pi reads cell Ci+,,-thus 
CRCW PRAM LOWER BOUNDS 17 
reading inputs (xi+ *,, . . . . xi+ 2 ,+I _ ,)-and writes the encoding of the inputs 
(Xi, s..) Xi+z’+l - 1 ) which it knows into cell Ci. During step rlog nl+ 1 
processor P, knows the entire input so instead of writing the encoding of 
the input into cell C, it writes the function value into that cell. 
Only half the processors in a given step need to continue on to the next 
step since their actions never affect the final output. In fact, from step t on, 
only processors whose indices are congruent to 1 modulo 2’ ever need to be 
active. 1 
The above algorithm uses the power of the processors to remember large 
portions of the input. If, for example, it is necessary to compute symmetric 
functions of Boolean inputs then machines with much less local memory 
would suffice. We shall refer to the algorithm as the binary fan-in tree 
algorithm. We now see that an Abstract CRCW PRAM can even beat the 
binary fan-in tree algorithm for inputs in (0, 1)“. 
THEOREM 3.1. A common-write Abstract CRCW PRAM which has 
p(n) 2 2n and c(n) 2 p(nYlogMn)l n can compute any function of inputs ) 
{O,l)” in time logn-loglog(p(n)/n)+CI(l). 
Proof: Suppose there are p(n) =n2k processors (k<n). We exhibit an 
algorithm which computes the function in log n - log k + 2 steps using 
p(n)/k memory cells. Break up the input into chunks of k bits each. Assign 
k2/’ processors to each chunk, and associate a label (i, ~1) with each 
processor for each c( E { 0, 11” and each i = 1, . . . . k. 
(1) In the first step each processor with label (i, a) reads the ith bit 
within its chunk and writes a 1 into a cell labelled a for its chunk if and 
only if the value it read disagreed with the ith bit of a. 
(2) In the second step one processor for each a within each chunk 
reads cell a and if it reads a 0 it writes a into a single cell designated for 
that chunk. 
The input has now been effectively compressed from n bits in n cells to 
n/k k-bit integers in n/k cells. The remainder of the algorithm now uses the 
binary fan-in tree algorithm on the n/k cells which contain the description 
of the input to compute the function. This takes [log n/k] steps since there 
is already one processor which knows the contents of each cell. Since 
log n/k = log n -log k the running time is as claimed. 1 
For the CRCW PRAM, a problem which a number of authors have 
considered is that of integer addition. Namely, given n integers stored in n 
cells of memory compute their sum as an output in a single cell. 
The first to consider this problem for CRCW PRAMS were Meyer auf 
der Heide and Reischuk (1984) who showed that if the instruction set of 
18 PAUL BEAME 
the machines is limited to operations like addition, indirect addressing, 
basic logical operations, and conditionals as well as integer multiplication 
and the input integers are sufficiently large then the sum requires time 
log n. Their results depend on the fact that all functions computed by such 
machines are polynomials of the original inputs of a certain form and so 
the instruction set comes into the proof in a crucial way. The integers for 
which they obtained the lower bound require O(np(n)) bits to be represen- 
ted where p(n) is the number of processors in the machine. 
Subsequently two papers, independent of each other, extended this result 
to CRCW PRAMS of the general form with which we are concerned in this 
paper. Meyer auf der Heide and Wigderson (1985) used the Erdos-Rado 
theorem to show that if the integers are sufficiently large then the sum 
requires log n time to compute on a general CRCW PRAM. Like many 
other Ramsey-theoretic arguments, in order to be sufficiently large, the 
integers involved must be huge-certainly containing exponentially many 
bits in the number of processors and memory cells. At about the same time 
Parberry (1986) used a more direct technique to show a better result, that 
the log n lower bound holds even if the integers have a number of bits 
which is a small degree polynomial in n and the number of processors and 
cells. 
We are interested in obtaining lower bounds which hold for all machines 
with reasonable limitations on their resources in terms of the size of the 
problem instance. With this in mind, we observe that even the lower bound 
for integer addition which Parberry obtains requires the integers in the 
problem definition to have many more bits than the amount of resources 
allowed to solve the problem. This is not entirely satisfactory since it means 
that, in order to find a single family of instances which is hard for all 
machines with polynomially bounded resources, the integers must have a 
number of bits which grows faster than any polynomial in n. 
We consider a function, related to the integer addition problem, which 
is, in a sense we describe, the hardest function to compute on [O, I]“. We 
give tight lower bounds for this function which yield tight lower bounds of 
log n for the integer addition problem with input integers having relatively 
few bits. Independently, Li and Yesha (1986) have obtained essentially the 
same lower bounds using quite different techniques which are similar to 
those of Parberry but which also use Kolmogorov complexity. 
DEFINITION. Let A be a partition of a set ZE { 0, 1)“. Define the size of 
A, #(A), to be the number of equivalence classes in A. 
Let b > 2 and Z = (0, 1, . . . . b - 1). Let Encode, be the function on C” 
which converts an input string into the integer which it represents in base 
6. For inputs from C” this is the hardest function to compute on the 
CRCW PRAM LOWER BOUNDS 19 
CRCW PRAM since its output provides a complete description of the 
input and in one further step any processor can read the first cell and 
compute any other function. 
THEOREM 3.2. Any CRCW PRAM to compute the function Encode, 
described above requires T(n) 2 log n + 1 - log[ 1 + log, 2p(n)]. 
Proof. The important fact about the definition of Encode, is that when 
it has been computed the contents of the first cell must induce a partition 
of the inputs which has b” distinct classes. That is any M that computes 
Encode, must satisfy # (C(M, 0, T(n))) = b”. Let p, and c, for t 2 0 satisfy 
the following recurrences: 
PO=1 
co = b 
Pr + I = Pr c, 
C,+,=P(n)P,+I+c, 
CLAIM. p, (respectively c,) is an upper bound over all processors (cells) 
of the size of the processor (cell) partition induced on the input set at the 
end of time step t, i.e., 
pr 2 max # (P(M, i, t)) 
c,>max #(C(M,j, t)). 
Since the input is initially present in the n cells, C,, . . . . C’, and since the 
processors initially have no access to it, it is clear that 
# (P(M, i, 0)) = 1 = p. 
#(C(M, j,O))db=c,. 
The cell which a processor reads during time step t + 1 can depend only on 
its state at time t so that on elements of one class in the partition only one 
cell may be read. Thus during time step t + 1 each class in the processor 
partition at the end of step t may be split into at most the number of 
classes in the partition of the cell from which it read. It follows that 
# (P(M, i, t + 1)) < # (P(M, i, t)) m?x # (C(M, j, t)) < ptc, = pt+, . 
A cell may have all p(n) processors writing into it during step t + 1 and 
each processor may communicate the entire partition of the portion of the 
20 PAULBEAME 
input on which it succeeds in writing into the cell. Also the cell may still 
maintain the partition of the portion of the input on which no processor 
writes. Thus the number of classes into which the contents of the cell may 
resolve the input satisfies 
/J(n) 
#(C(M/‘, t+l))< 1 #(PO-f, i, t+ I))+ #(C(M,j, 1)) 
i= 1 
~P(n)PI+l+c,=c,+l. 
Thus the claim follows. 
Easy calculation shows that for t 3 1 we can bound c, above by 
(2p(n)6)2’-‘. In order to compute Encode, in T steps, c,a 6” is needed. 
Therefore 2’-‘Cl +logb2p(n)]>n and so T>logn+ 1 -log[l + 
h&J 2An)l. I 
By choosing b = 2 in Theorem 3.2 we see that the bound in Theorem 3.1 
is asymptotically tight. Using Theorem 3.2 it is an easy step to prove a 
lower bound for the addition of integers. 
COROLLARY 3.1. The sum of n integers of n log b bits each requires time 
at least log n + 1 - log[ 1 + log, 2p(n)] to compute on a CRCW PRAM with 
p(n) processors. 
Proof The function Encode, is exactly the function computed by 
considering the ith input xi as xi b’- ’ and adding the resulting integers. 
The corollary follows immediately since the largest such integer is bounded 
above by b”- ’ and so requires only n log b bits. 1 
COROLLARY 3.2. For n sufficiently large, the sum of FI integers with 
w(n log n) bits each requires time log n to compute on a CRCW PRAM 
given a number of processors polynomial in n. 
COROLLARY 3.3. The sum of n integers with R bits each requires time 
Q(log n) to compute on a CRCW PRAM with as many as 2”” processors for 
any E< 1. 
There is something apparently contradictory about these results. The 
sum of n integers with polynomially many bits can be computed in 
O(log n) depth using combinatorial circuits. Thus, using a general of 
Chandra, Stockmeyer, and Vishkin (1984), the sum can be computed in 
O(log n/log log n) depth using polynomial size unbounded fan-in circuits. 
How then can we reconcile the log n lower bound with the simulation of 
Stockmeyer and Vishkin? 
CRCW PRAM LOWER BOUNDS 21 
The simulation of Stockmeyer and Vishkin shows how each output of an 
unbounded fan-in circuit can be computed on a CRCW PRAM in time 
equal to the depth of the circuit. Since circuits can only present each bit of 
an integer as a single output, each bit of the sum can in fact be computed 
in time O(log n/log log n) on a CRCW PRAM. It is merely the requirement 
that the entire output must appear in one cell which is reponsible for the 
additional complexity. 
This illustrates a general phenomenon of function computation on 
CRCW PRAMS. In the following sense it really consists of decision 
problem computation plus computation of an appropriate version of the 
Encode function. 
THEOREM 3.3. For each n let f,, be a function defined on (0, 1)” with 
range R = R(n). Since IR(n)l < 2” for each n there is a binary encoding of R 
which uses at most n bits. Let d;: R(n) -+ (0, 1) be the,function which yields 
the ith bit of this binary encoding function on R(n). Let p(n) and c(n) grow 
as no(‘). Then 
log log /RI - log log n + O( 1) 
< T,d my Td, u.f + log log 1 RJ - log log n + 0( 1) 
< T,+ log log (R( -log log n + 0( 1 ), 
where T, , .f is the time required to compute d, 0 f on a CRCW PRAM with 
p(n) processors and c(n) memory cells, and T, is the time to compute f on a 
CRCW PRAM with rip(n) processors and nc(n) cells. 
*Proof. The first inequality follows from the bound, c,< (4p(n))*‘-‘, 
given in the proof of Theorem 3.2 and the third inequality follows trivially 
since die f can be computed in one step given the value of J The second 
inequality follows from the algorithm which computes each bit of the 
encoding of R separately and then encodes the result using the algorithm 
given in Theorem 3.1. i 
COROLLARY 3.4. Given f R, di, Tr, and T,, ,. as defined in the theorem 
above 
T,-= Q(max T, ,f + log log (RI -log log n). 
This corollary really implies that the only thing which is interesting 
about function computation on the CRCW PRAM as opposed to decision 
problem computation is the Encode function. We consider decision 
problem computation on the CRCW PRAM in Section 5. 
22 PAUL BEAME 
4. PARTITIONS, DEGREES, AND RESTRICTIONS 
We need a few definitions which will allow us to define a measure of 
progress for CRCW PRAM computation. Since these have a somewhat 
independent interest we put these in a separate section. 
We may describe each equivalence class in a partition of the input set 
I= (0, 1)” by a Boolean formula which expresses the characteristic 
function of that equivalence class. When this formula is written in dis- 
junctive normal form (DNF), the subset of the inputs which it describes 
may be written as the union of inputs which satisfy each clause. 
DEFINITION. For any partition A of JC I= (0, 1)” let the degree of A, 
6(A), be the length of the longest clause in a minimal DNF formula for the 
characteristic function of each equivalence class in A when considered as a 
subset of J. 
Remark. If a partition B is a refinement of partition A then 
6(A) 6 6(B). 
A restriction 7c of the input set I is a map 
where 
n: ( 1, 2, . ..) n> -+ {O, 1, *>, 
means xi is set to 1 
means xi is set to 0 
means xi is unset. 
DEFINITION. For any partition A of I and any restriction rc let Ar, be 
the restriction of A to the set Zr, = (X E I: Vx, set by 7t, xi = n(i)}. If F is a 
Boolean formula then Fr, is the formula obtained by replacing each literal 
corresponding to an input which TT sets by the truth value assigned by z(i). 
LEMMA 4.1. Let IT be a restriction and A a partition of I. Zf F is a DNF 
formula for the characteristic function of some equivalence class CE A then 
Fr, is a DNF formula for the characteristic function of the corresponding 
equivalence class Cr n E Ar x considered as a subset of Zr =. 
DEFINITION. A random restriction p chosen from R, is a function which 
independently assigns 0, 1, or * to each is { 1, 2, . . . . n}, where 
Pr[p(i) = *] = q 
Pr[p(i)=l]=f(l-q) 
Pr[p(i) = 0] = $( 1 - q). 
CRCW PRAM LOWER BOUNDS 23 
Let rc be a restriction of the input set I. A random q-specialization 71’ of TC is 
a restriction which agrees with rc on all the inputs set by rc and which takes 
the values set by a randomly chosen p E R, on the remaining inputs. 
An important measure of the progress in the computation of a CRCW 
PRAM will be 6(Ar,), where A is P(M, i, t) or C(M,j, t) and z is an 
appropriate restriction. 
5. LOWER BOUNDS FOR SOME SIMPLE FUNCTIONS 
The parity function is the function on binary values x,, . . . . x, which 
produces their sum modulo 2. Furs& Saxe, and Sipser (1984) gave an 
sZ(log* n) lower bound on the depth of polynomial size unbounded fan-in 
circuits computing parity. Ajtai, extending the results in Ajtai (1983), and 
Babai (1984), improved the depth lower bound for polynomial size parity 
circuits to Q(s). Independently, by applying and modifying the 
techniques of Furst, Saxe, and Sipser (1984), we were able to derive an 
intermediate lower bound (Beame (1985)) which states that any CRCW 
PRAM computing the parity function and having p(n) unbounded but 
c(n)=n O(‘) requires time r(n) = Q(Jlloglogn). 
The results for unbounded fan-in circuits were based on efforts to achieve 
exponential lower bounds for constant depth circuits computing parity. 
Yao, in a breakthrough paper (Yao, 1985), was able to give exponential 
lower bounds for constant depth parity circuits but his results do not imply 
anything better than Q(fi) lower bounds for polynomial size parity 
circuits. Also, because of the notion of approximation, Yao’s proofs do 
not appear to translate well to the machine model with which we are 
concerned. 
Using some of the techniques in Yao (1985), Hastad (1986) has obtained 
improved lower bounds for parity circuits. His improved results yield 
Q(log n/log log n) lower bounds for polynomial size parity circuits which 
match the upper bounds for such circuits given by Chandra, Stockmeyer, 
and Vishkin (1984). He also eliminates the necessity for approximation as 
used by Yao. This makes his proofs amenable to modification and 
application to obtain the following result which has a much shorter proof 
than that in Beame (1985). 
THEOREM 5.1. If M is a CRCW PRAM which computes the parity 
function with p(n) = no(‘) and c(n) unbounded then T(n) = O(s). 
During the course of the computation of a CRCW PRAM its processor 
and cell partitions are becoming more and more complex. The basic idea of 
24 PAULBEAME 
the proof of this theorem is that despite this if we set the values of some of 
the input variables then the processor and cell partitions have small degree 
on the remaining variables. Of course the number of variables whose values 
are set must grow over time as the partitions become more complex. The 
result will follow because any partition for the restricted parity function 
requires degree equal to the number of the remaining variables. 
Proof of Theorem 5.1. We assume without loss of generality by 
Lemma 2.1 that M satisfies the common-write rule since the simulation 
only squares the number of processors. We will also assume that p(n) 2 n 
and when processors write a value they tag it with the time step during 
which they are writing. This does not conflict with the common write and 
can only transmit additional information. 
We will define a restriction rr, for each step t of the computation such 
that after step t and after 71, is applied, the cell partitions will all have 
degree less than the number of unset bits. The lower bound will follow 
since parity has minimal DNF clauses which depend on all the unset bits. 
Define rcO so that rcO(i) = * for all i (i.e., all bits are unset) and z _ i = rrO. 
rr, will be chosen from the random q,-specializations of rr-, , where q, will 
be defined later. For t 2 0, let pI = max, 6(P(M, i, t)r.,+ ,), c, = 
maxi 6(C(M, j, t)r,,), and m, be the number of bits unset by rc,. 
Let h,= 3’log, p(n) and qr= (1/20b,)p(n))“*l= (1/20b,) H-~-‘. 
CLAIM. For t >O we can choose 71, so that pr< b,, c,<2b,, and 
m,3n2p’nj=, qi. 
We show this by induction. The base case is pO = 0 < c0 = 1 d b, and 
m, = n. The induction step is the following. Let t > 1. Since the cell which a 
processor reads during step t is dependent solely upon its current state, the 
new state which the processor assumes will depend only on the old state 
and the equivalence class in which the value in the cell read lies. Since the 
clauses in the cell partition have only c,~ , literals corresponding to unset 
input bits, the longest DNF clause needed to describe the new state’s 
equivalence class will have at most pr- 1 + c,~ , literals corresponding to 
input bits which were unset after step t. Thus it follows that 
From this recurrence it is easy to see that the concurrent reads do not give 
the CRCW PRAM much of its power. An identical recurrence would also 
hold for CREW PRAMS. It is the concurrent writes which cause all the 
difficulty. 
We now try to bound c,. Cells into which no processor writes during 
step t to any input IT,,+, will be unaffected by the write step and so will 
CRCW PRAM LOWER BOUNDS 25 
have equivalence classes with clauses bounded by c,~ 1. Thus we need only 
consider cells into which processors may write on some input in Zr n,~, . We 
say that such cells are used during write step t. 
For the cells used during write step t, we first consider equivalence 
classes which correspond to values written by processors during step t. 
Becuase M satisfies the common-write rule, the set of inputs on which some 
value u is written into a particular cell is the union of the sets of inputs on 
which each processor writes u into that cell. In fact, because of the tagging 
by time step during writes, the equivalence class in the cell corresponding 
to v is exactly such a union. Since the set of inputs on which a particular 
processor performs any action is a union of equivalence classes in that 
processor’s partition, the set of inputs on which some processor writes u 
into a particular cell is a union of equivalence classes of processors. Thus 
the equivalence class corresponding to v is a union of classes with DNF 
clauses bounded by p,, and so its DNF clauses have maximum length at 
most p, d h,. 
For each cell used during write step t it remains to consider the 
equivalence classes which correspond to the cases when no processor has 
actually written into the cell during step t. Each such class is the intersec- 
tion of some old cell class and the set of inputs on which no processor 
writes into the cell during step t. We will choose 71, in such a way that the 
set of inputs on which no processor writes into the cell is described by a 
DNF formula whose clauses are bounded by b,. Then each class within the 
cell will have DNF clauses bounded by c,~ , + b,. To achieve this we note 
that the set of inputs on which some processor writes into a cell is a union 
of processor classes and so is already described by a DNF formula with 
clauses bounded by pt. Then it follows that G, the formula describing the 
set of inputs on which no processor writes into the cell, is the negation of a 
formula with short DNF clauses and so has CNF clauses bounded by 
P, < 6,. We now apply the following lemma which is essentially the main 
lemma of Hastad (1986). 
LEMMA 5.1 [(Hastad (1986)]. Let G be any CNF formula each of 
whose clauses has at most r literals. Let p be a random restriction chosen 
from R,. Then 
Pr[Gr,, has minimal DNF with any clause 3 s] d (5qr)“. 
Thus if we choose a random q,-specialization p of 7c,- I 
PrC&C(M j, Or,,) 3 L, + hl < (Wdhi 
= (~p(n)-‘lh)h 
=4php(n)p’. 
26 PAUL BEAME 
It is clearly only necessary to apply Hastad’s lemma once per cell used 
during write step t. An upper bound on the number of cells used during 
write step t is certainly the number of possible states of all processors at the 
time of the write. This is just the number of classes in the processor par- 
titions at the time of the write. Because the DNF clauses describing each 
equivalence class are bounded by P, d h, and a class is the union of the set 
of inputs satisfying each DNF clause, each class contains a fraction of at 
least 2-b! of the possible inputs. The classes within a single processor par- 
tition are disjoint so there at most 2’1 classes per processor. Thus at most 
p(n) 2hf cells are used during write step t. It follows that 
Pr[Ll a cell Cj such that 6(C(M, j, t)r,) > c, _, +h,] < 2 hl. (1) 
If I is the number of bits unset by p then by Chebyshev’s inequality we 
have 
Pr[kv]<&. 
Since the probabilities in (1) and (2) add up to less than 1 for n suf- 
ficiently large we may choose 7c, to be one of the random q,-specializations 
of 7c&, so that neither condition holds. In this case we have 
c,<c,-, +b,<2b, and m,>m- ,q,/2 >, n2-’ n;=, qi by the inductive 
hypothesis. The claim follows by induction. 
Now, parity requires t steps if m, , > b,. This is satisfied provided 
n2-“-” n qi>b, 
i= I 
or provided 
n1-(1/3+1/9+ .. +1/3’-‘)>40l-1 f, b,. 
(3) 
i= I 
The right side of (3) is of the form 3’2’2(c, log, p(n))’ for some constant cl. 
Since p(n) = no”‘, log,, p(n) is a constant. Condition (3) is then satisfied for 
some t = O(G). This in order to compute parity we must have 
T(n) = Q(S). fl 
Remark I. It was not really necessary to require that p(n) =n”“’ to 
obtain the above lower bound. In fact one can show the same lower bound 
with p(n) as large as n 2fv’G for some E > 0. On the other hand one does not 
obtain a stronger lower bound using our techniques by polynomially 
bounding the number of memory cells as well as the number of processors. 
Remark II. Using essentially the same argument as given above (in fact 
CRCW PRAM LOWER BOUNDS 27 
a slightly simpler one) one can prove the same lower bounds for common- 
write CRCW PRAMS which have no restrictions on the number of 
processors but which have c(n) bounded by n2eJlogn for some E > 0. 
Remark III. Methods found by Li and Yesha (1986) independently of 
this work can allow one to obtain results similar to those in Theorem 5.1 
and Remark I. As actually stated in that paper the time bound is the same 
as that in Beame (1985); however, slightly tightening up the simulation of 
Abstract CRCW PRAMS by circuits and applying Hastad’s instead of 
Yao’s lower bounds for circuits will yield the improved bounds. It is not 
clear that one could obtain the results in Remark II by this method. 
Using the reductions described by Chandra, Stockmeyer, and 
Vishkin (1984) we may derive the following bounds among others. 
COROLLARY 5.1. Any CRCW PRAM which sorts or adds n input bits or 
computes the product of two n-bit integers requires time G?(G) of” it uses 
a number of processors which is bounded by a polynomial in n. 
6. SUMMARY AND OPEN PROBLEMS 
The Abstract CRCW PRAM we have described seems to be a natural 
model in which to prove lower bounds about concurrent-read-concurrent- 
write machines. 
It is natural to try to improve on the parity lower bound for Abstract 
CRCW PRAMS to match the upper bound of log n/log log n given by 
Chandra, Stockmeyer, and Vishkin (1984). Hastad (1986) was able to do 
this for unbounded fan-in circuits but the processors in an Abstract CRCW 
PRAM are accumulating information about the input in a very different 
way from these circuits and for this reason it may be possible that the 
upper bound for these machines is not optimal. 
The simulations in Stockmeyer and Vishkin (1984) shows that with a 
restricted instruction set a CRCW PRAM with a polynomial number of 
processors does not need more than a polynomial number of memory cells. 
The bounds in Beame (1985) and Theorem 5.1 appear to be incomparable 
since they restrict different resources--one restricts processors but not cells 
and the other restricts cells but not processors. It seems worthwhile 
elaborating general relationships between the number of processors and the 
number of cells in the case of unrestricted instruction set machines. 
The most interesting open questions concerning the Abstract CRCW 
PRAM are the following: Are there specific Boolean functions for which we 
can prove stronger lower bounds than those for parity? Are there non- 




I thank Steve Cook for several discussions which helped initiate this work and which helped 
formalize some of the main ideas in it. I also extend thanks to Al Borodin and Faith Fich for 
carefully reading drafts of this paper and for helpful comments which have corrected and 
improved its presentation. I am grateful to Friedholm Meyer auf der Heide for pointing me to 
Corollary 3. I. 
RECEIVED August 13, 1985; ACCEPTED January 8, 1987 
REFERENCES 
AJTAI, M. (1983). x;-Formulae on linite structures. Ann. Pure Appl. Logic 24, 148. 
BABAI, L. (1984), private communication. 
BEAME, P. W. (1985), Lower bounds for very powerful parallel machines, manuscript. 
CHANDRA, A. K., STOCKMEYER, L. J., AND VISHKIN, U. (1984) Constant depth reducibility, 
SIAM J. Comput. 13 (2), 423-439. 
COOK, S. A., DWORK, C., AND REISCHUK, R. (1984), Upper and lower time bounds for 
parallel random access machines without simultaneous writes, SIAM J. Comput. 15 (l), 
87-97. 
FICH, F. E., MEYER AUF VER HEIDE, F.. RAGDE, P.. AND WIGVERSON, A. (1985). One, two, 
three... intinity: Lower bounds for parallel computation, in “Proc. 17th, ACM-STOC,” 
pp. 48-58. 
FURST, M., SAXE, J. B., AND SIPSEK, M. (1984), Parity, circuits, and the polynomial time 
hierarchy, Ma/h. Sysfems Theory, 17 (l), 13-28. 
HASTAD, J. (1986) Improved lower bounds for small depth circuits, in “Proc. 18th ACM- 
STOC,” pp. 6-20. 
KUCERA, L. (1982) Parallel computation and conflicts in memory access, Inform. Process. 
Lerf. 14 (2), 93-96. 
LI, M.. AND YESHA. Y. (1986). New lower bounds for parallel computation, in “Proc. 18th 
ACM-STOC,” pp. 177-187. 
MEYER AUF DER HEIDE, F.. AND REISCHUK. R. (1984). On the limits to speed up parallel 
machines by large hardware and unbounded communication, in “Proc. 25th FOCS,” 
pp. 56-64. 
MEYER AUF DER HEIDE, F., AND WIGDERSON, A. (ISSS), The complexity of parallel sorting, in 
“Proc. 26th IEEE-FOCS,” pp. 532-540. 
PARBERRY, I. (1986) On the time required to sum n semigroup elements on a parallel machine 
with simultaneous writes, in “Proc. Aegean Workshop on Computation.” 
Ruzzo, W. L. (1986), private communication. 
SAVAGE. J. E. (1976), “The Complexity of Computing,” Wiley, New York. 
SNIR, M. (1985), On parallel searching, SIAM J. Compuf. 14 (3), 688-708. 
STOCKMEYER, L. J., AND VISHKIN, U. (1984), Simulation of parallel random access machines 
by circuits, SIAM J. Compuf. 13 (2), 404422. 
VISHKIN, U., AND WIGDERSON, A. (1985), Trade-offs between depth and width in parallel 
computation, SIAM J. Cornput. 14 (2), 303-314. 
YAO, A. C. (1985) Separating the polynomial-time hierarchy by oracles: Part I, in “Proc. 
26th IEEE-FOCS," pp. l-10. 
