We prove lower bounds for the computation of simple functions on generalized versions of parallel random access machines which allow both concurrent reads and concurrent writes. In particular we show that if the number of processors is limited by a polynomial in n then computing the sum of n n-bit integers requires time f2(log n ) and computing the parity of n input bits requires time fl(x/~n ). The latter result, using reductions given by Chandra, Stockmeyer, and Vishkin (1984) , implies that a host of problems including sorting or adding n input bits, or multiplying two n/2-bit integers also require time fl(x/~ n ) to compute.
Introduction
Parallel random access machines ( PRAM's ) are well-accepted as good models for parallel computation. The procedural nature of the way in which they compute and the relatively natural way of enforcing uniformity conditions on them have made them popular for the design of parallel algorithms. They consist of many processors acting in consort and communicating through some shared memory. They operate much like familiar sequential RAM's except for the rules for concurrently accessing the shared memory. There are three main classifications of these rules: exclusive readexclusive write (EREW), concurrent read -exclusive write (CREW), and concurrent read -concurrent write Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1986 ACM 0-89791-193-8/86 /0500/0169 $00.75 (CRCW) .
The most powerful of these machines (CRCW) can be simulated by the weakest (EREW) with a delay per step proportional to the logarithm of the number of processors. Cook, Dwork and Reischuk (1984) have shown that either the EREW or CREW PRAM's require O(log n) time to compute the OR of n bits. Their lower bound holds independently not only of the number of processors or the size of the shared memory but even of the way the program is specified and the instruction set of a processor. It really says something about the restrictive nature of the communication itself.
The OR of n bits can easily be computed in constant time on a CRCW PRAM and it is easy to see how with sufficiently many processors and cells any boolean function can be computed in constant time. It is natural then to consider CRCW PI:LA_M's which have resources bounded by a polynomial in the size of the input and try to prove lower bounds similar to those for PRAM's which only have exclusive writes. Previous lower bounds for CRCW PRAM's have approached this in different ways. Some have put extremely stringent restrictions on the number of processors or the size of the shared memory, e.g. Vishkin and Wigderson (1985) , or Fich, Meyer auf der Heide, Ragde and Wigderson (1985) . Other bounds which hold with a polynomial number of processors or ceils but which put severe restrictions on the instruction sets of the PRA.M's are due to Stockmeyer and Vishkin (1984) and Meyer auf der Heide and Reischuk (1984) . Other more powerful primitives than those they allow seem to be perfectly reasonable since the cost of concurrent accesses is presumably significantly greater than that of local computation. A symptom of the restricted nature of these CRCW PRAM's is that most Boolean functions require time fl(7----) to compute given a polynomial lo~ng number of processors and cells whereas all Boolean functions can be computed in O(log n) time by the EREW or CREW machines of Cook, Dwork and Reischuk using only n processors and cells.
We consider computation of specific functions for the most general form of CRCW machine which we call the Abstract CRCW PRAM. We first show that such a machine can compute any function of Boolean inputs in time log n -loglog n + O (1) given a polynomial number of processors and cells and that this bound is tight. Essentially the same machines are considered by Meyer auf der in which the authors prove an ~( l~n ) time lower bound for sorting n integers and a O(log n) lower bound for adding n integers. Their results rely on Ramsey theory and as a consequence their bounds only hold for integers which are extremely large -so large that the problems in question are not polynomial time computable given a sequential machine with an honest complexity measure like the log-cost RAM.
We show that on the Abstract CRCW PRAM the sum of n numbers requires O(log n ) time even if for example the numbers have only n bits each. Using a result of Hastad (1986) concerning unbounded fan-in circuits, the main bound we prove is that on this extremely powerful model some very simple functions on {0,1} a require time fl( 1~ n ) to compute given a polynomial bound on the number of cells and processors. These functions include computing the parity of, sorting, or adding n bits as well as multiplying two n/2-bit integers. These results are the first non-trivial lower bounds for computing Boolean functions on CRCW PRAM's which do not rely on restricted instruction sets of processors or resources smaller than the size of the problem input. They show that there is something inherent in the ways in which processors communicate with each other which limits their computational power.
The Abstract CRCW PRAM

Definition:
An Abstract CRCW PRAM is a shared memory machine with processors P 1, .
• • , Pp (n) which communicate through memory cells X1,...,Xc(n).
The input is initially stored in the first n cells of ----v (qi t +1) into some cell.
When several processors are attempting to write into a single cell at the same time step the one that succeeds will be the lowest numbered processor.
A processor may write anything into its cells when it writes, including a full description of the portion of the history of the computation which it knows, as well as its own processor number. The idea that these processor and cell partitions are the crucial aspects of a computation is implicit in much of the lower bound work in this area. For example, Snir (1985) has given a formal explanation of what a most powerful EREW PRAM can do which is essentially a method of describing the ways that the processor and cell partitions of that machine may be combined during a computation.
It will be useful to use the following restriction of the Abstract CRCW PRAM in order to simplify our discussions of these machines. Definition: We say that an Abstract CRCW PRAM satisfies the common write rule if whenever several processors are attempting to write into a single cell at the same time step, the values that they are attempting to write are the same.
Lemma 2.1: [Kucera (1982) ] Let M be a general Abstract CRCW PRAM with p (n) processors, c (n) memory cells and taking T (n) time. Then an Abstract CRCW PRAM which satisfies the common write rule can simulate M by using O(p(n) 2) processors, c (n) + p (n) memory cells and taking 4T(n ) time.
The Computation of Arbitrary Boolean Functions and Integer Addition
From the Lupanov bounds on combinational circuit complexity (see Savage (197G) ), the obvious simulation of unbounded fan-in circuits by combinational circuits, and the simulations in Stockmeyer and Vishkin (1984) we see that restricted CRCW PRAM's require time ~(lo~n) to compute most Boolean functions given a polynomial number of processors. We now see that an Abstract CRCW PRA_M has much greater computational power than CRCW PRAM's with restricted instruction sets. Theorem 3.1: An Abstract CRCW PRAM which satisfies the common write rule and has c(n) >_ p(n) ~_ n van compute any function of inputs in time log n -loglog (p(n)) + 0 (1). Boolean n Proof: Suppose there are p (n)~-n2 k processors. We exhibit an algorithm which computes the function in log n -logk + 3steps. Break up the input into chunks of k bits each. Assign k 2 k processors to each chunk, and associate a label (i,a) with each processor for each aE {0,1} k and each i~-l,..., k.
(1) In the first step each processor with label (i ,a) reads the i-th bit within its chunk and writes a 1 into a cell labelled a for its chunk if and only if the value it read disagreed with the i-th bit of a.
(2) In the second step one processor for each a within each chunk reads cell a and if it reads a 0 it writes a into a single cell designated for that chunk. The input has now been effectively compressed from n n n bits in n cells to ~ k-bit integers in --~ cells. The remainder of the algorithm now uses a standard binary n fan-in of the --~ cells which contain the description of the input to arrive at the situation where a complete description of the input is contained in one cell.
This takes log-b-steps since there is already one processor which knows the contents of each cell. Processor P 1 now reads this cell and writes the value of the function into X 1-n Since log-F-=log n -log k the running time is as 
Proof: The important fact about the definition of f is that when it has been computed the contents of the first cell must induce a partition of the inputs which has b " distinct classes.
Let Pt (C t) be a least upper bound over all processors (cells) of the number of classes in the processor (cell) partitions induced on the input set at the end of time step t. Since the input is initially present in the first n cells and since the processors initially have no access to it, it is clear that:
P0= 1 and C 0~ b
The cell which a processor reads during time step t +1 can only depend on its state at time t so that on elements of one class in the partition only one cell may be read. Thus each class in the partition at time t may be split into at most C t classes during time step t +1 since this is the maximum number of classes distinguishable for a cell at time t. It follows that:
A cell may have all p (n) processors writing into it during step t +1 and each processor may communicate the entire partition of the portion of the input on which it succeeds in writing into the cell. Also the cell may still maintain the partition of the portion of the input on which no processor writes. Thus the number of classes into which the contents of the cell may resolve the input satisfies:
Q +1 = p (n) e~ +1 + Q Easy calculation shows that for t >1 we can bound C t by (2p (n)b )2'-'.
In order to compute f b in T steps, C T > bn is needed. Therefore 2T-l[ 1+ Iogb 2p (n) ] > n and so
By choosing b----2 in Theorem 3.2 we see that the bound in Theorem 3.1 is tight. Using Theorem 3.2 it is an easy step to prove a lower bound for the addition of integers. and adding the resulting integers. The Corollary follows immediately since the largest such integer is bounded by b n-I and so requires only n log b bits. [] Corollary 3.2: The sum of n integers with w (n log'n ) bits each requires time log n to compute on an Abstract CRCW PRA.M given a number of processors polynomial in n.
Corollary 3.3: The sum of n integers with n bits each requires time f2(log n ) to compute on an Abstract CRCW PRAM with as many as 2"' processors for any e<l.
A log n lower bound for integer addition on similar machines has also been proved by Parberry (1985) but the integers must have more than polynomially many bits in n for this to hold for all machines with a polynomial number of processors and cells.
It is interesting to note that, since the sum of n integers with polynomially many bits has O (log n) depth using combinational circuits, each bit of the output of the sum can be computed in time log n O (log~ogn) on a CRCW PRAM. It is merely the requirement that the entire output must appear in one cell which is responsible for the additional complexity.
Partitions, Lengths and Projections
We may describe each equivalence class in a partition of the input set I = {0,1} n by a Boolean formula which expresses the characteristic function of that equivalence class. When this formula is written in disjunctive normal form (DNF), the subset of the inputs which it describes may be written as the union of inputs which satisfy each clause.
Definition: For any partition A of J _CI -~-{0,1} n let the length of A, l (A) be the length of the longest clause in a minimal DNF formula for the characteristic function of each equivalence class in A when considered as a subset of J.
Remark: If a partition B is a refinement of partition A thenl (A)_~ l (B).
A projection ~r of the input set I is a map: 
I'-~ {xE I :Vx i set by~',x i ----r(i) ).
If F is a Boolean formula then F ~r is the formula obtained by replacing each literal corresponding to an input which 7r sets by the truth value assigned by ~r(i ).
Lemma 4.1 Let ~-be a projection and A a partition of I. If F is a DNF formula for the.. characteristic function of some equivalence class C E A then F ~ is a DNF formula for the characteristic f anction of the corresponding equivalence class C~E A ~ considered as a subset of I ~r.
Definition: A random restriction p chosen from Rq is a function which independently assigns 0, 1, or * to each iE {1,2, . . . , n }, where:
Let r be a projection of the input set I. A random qspecialization r" of 7r is a projection which agrees wlth rr on all the inputs set by r and which takes the values set by a randomly chosen pE Rq on the remaining inputs.
An important measure of the progress in the computation of an Abstract CRCW PRAM will be i (At) where A is P (M,i, t) or C (M, j, t) and 7r is an appropriate projection.
5.
Lower Bounds for Some Simple Functions
The parity function is the function on binary values x 1, • • • , x, which produces their sum modulo 2.
Furst, Saxe and Sipser (1984) gave a f~(log*n ) lower bound on the depth of polynomial size unbounded fanin circuits computing parity. Ajtai, extending the results in Ajtai (1983) , and Babai (1984) , improved the depth lower bound for polynomial size parity circuits to f~( l~og n ). Independently, by applying and modifying the techniques of Furst, Saxe and Sipser (1984) , we were able to derive an intermediate lower bound which also applies to the Abstract CRCW PRAM model described here.
Theorem 5.1: [Beame (1985) ] If M is an Abstract CRCW PRAM computing the parity function with c (n) i--n o(1) and p (n) unbounded then
The results for unbounded fan-in circuits were based on efforts to achieve exponential lower bounds for constant depth circuits computing parity. Yao, in a breakthrough paper (Ye0o (1985) ), was able to give exponential lower bounds for constant depth parity circuits but his results do not seem to imply anything better than f~(v/~n ) lower bounds for polynomial size parity circuits. Also, because of the notion of approximation, Yao's proofs do not appear to translate well to the machine model with which we are concerned.
Using some of the techniques in Yao (1985) , Hastad (1986) has obtained improved lower bounds for parity circuits. His improved results yield 12( log n loglog n" lower bounds for polynomial size parity circuits which match the upper bounds for such circuits given by Chandra, Stockmeyer and Vishkin (1984) . He also eliminates the necessity for approximation as used by Yao. This makes his proofs amenable to modification and application to obtain the following result which has a much shorter proof than Theorem 5.1. Theorem 5.2: If M is an Abstract CRCW PRAM which computes the parity function with p in ) ~-n o(1) and c (n) unbounded then T (n) ----D(',/i~ n ).
Proof:
We assume without loss of generality by Lemma 2.1 that M satisfies the common write rule since the simulation only squares the number of processors. We will also assume that p (n) > n and when processors write a value they tag it with the time step during which they are writing. This does not conflict with the common write and can only transmit additional information.
We will define a projection lr t for each step t of the computation such that after step t and after rr t is applied, the cell partitions will all have length less than the number of unset bits. The lower bound will follow since parity has minimal DNF clauses which depend on all the unset bits.
Define no, rh,.., as follows: 7r0(i) = * for all i (all bits are unset).
~t will be chosen from the random qt-speciallzations of 7rt-1, where qt will be defined later. Claim: For t > 0 we can choose ~v t so that Pt --< bt,
t < 2b t and m t > n 2 -t IIqi" i=l
We show this by induction:
Base case: p 0 ~ 0 ~ c 0 ~ 1 < b 0 and m 0 -~-n.
Induction step: Let t _>1. Since the cell which a processor reads during step t is dependent solely upon its current state, the new state which the processor assumes will depend only on the old state and the equivalence class in which the value in the cell read lies.
Since the clauses in the cell partition have only ct_ 1 literals corresponding to unset input bits, the longest DNF clause describing the new state's equivalence class will have at most Pt-1 + ct-1 literals corresponding to input bits which were unset after step t. Thus it follows that:
Pt _~ Pt-l + ct-I _~ 3bt-l ~ 3tl°gn c (n ) ~ b t .
From this recurrence it is easy to see that the concurrent reads do not give the Abstract CRCW PRAM much of its power. An identical recurrence would also hold for CREW PRAM's. It is the concurrent writes which cause all the difficulty.
We now try to bound c t .
Cells into which no processor writes during step t on any input in I ~''-~ will be unaffected by the write step and so will have equivalence classes with clauses bounded by c t_l. Thus we only need to consider cells into which processors may write on some input in I *'-1. We say that such cells are used during write step t.
For the cells used during write step t, we first consider equivalence classes which correspond to values written by processors during step t. Because M satisfies the common write rule, the set of inputs on which some value v is written into a particular cell is the union of the sets of inputs on which each processor writes v into that cell. In fact, because of the tagging by time step during writes, the equivalence class in the cell corresponding to v is exactly such a union. Since the set of inputs on which a particular processor performs any action is a union of equivalence classes in that processor's partition, the set of inputs on which some processor writes v into a particular cell is a union of equivalence classes of processors. Thus the equivalence class corresponding to v is a union of classes with DNF clauses bounded by Pt, and so its DNF clauses have maximum length at most Pt --< bt.
For each cell used during write step t it remains to consider the equivalence classes which correspond to the cases when no processor has actually written into the cell during step t. Each such class is the intersection of some old cell class and the set of inputs on which no processor writes into the cell during step t. We will choose lr t in such a way that the set of !,nputs on which no processor writes into the cell is described by a DNF formula whose clauses are bounded by b t . Then each class within the cell will have DNF clauses bounded by ct_ 1 + b t. To achieve this we notice that the set of inputs on which some processor writes into a cell is a union of processor classes and so is already described by a DNF formula with clauses bounded by Pt. Then it follows that G, the formula describing the set of inputs on which no processor writes into the cell, is the negation of a formula with short DNF clauses and so has CNF clauses bounded by Pt _< bt. We now apply the following lemma which is essentially the main lemma of Hastad (1986) .
Lemma 5.1: [Hastad (1986) ] Let G be any CNF formula each of whose clauses has at most r literals. 
If r is the number of bits unset by p then by Chebyshev's inequality we have
Since the probabilities in (1) and (2) add up to less than 1 we may choose lr t to be one of the random qtspecializations of 7re_ 1 so that neither condition holds.
In this case we have c t < ce_ 1 + b e < 2b t and t m e > mt-lq------J-t > n 2 -t H qi
hypothesis.
The claim follows by induction. 
i~l 'the right side of (3) is of the form 3 t~/2 (c llOgn p (n) )t for some constant cl. Since p(n)= n°(l), log np(n) is a constant. Condition (3) is then satisfied for some t ----O( l~ n ). Thus in order to compute parity we must have T(n ) ~-fl(',/~ n ). [] Remark: It was not really necessary to require that p(n)-~-n O(1) to obtain the above lower bound. In fact one can show the same lower bound with p (n) as large as n 2~¢F/7 for some e>0. On the other hand one does not obtain a stronger lower bound using our techniques by polynomially bounding the number of memory cells as well as the number of processors.
Using the reductions described by Chandra, Stockmeyer and Vishkin (1984) we may derive the following bounds amongst others.
Corollary 5.1: Any Abstract CRCW PRAM which sorts or adds n input bits or computes the product of two n-bit integers requires time l]( l~ n ) if it uses a number of processors which is bounded by a polynomial in n.
Summary and Open Problems
The Abstract CRCW PRAM we have described seems to be a natural model in which to prove lower bounds about concurrent-read-concurrent-write machines.
It is natural to try to improve on the parity lower bound for Abstract CRCW PRAM's to match the upper bound of log n given by Chandra, loglog n Stockmeyer and Vishkin (1984) . Hastad (1986) was able to do this for unbounded fan-ln circuits but the processors in an Abstract CRCW PRAM are accumulating information about the input in a very different way from these circuits and for this reason it may be possible that the upper bound for these machines is not optimal.
The simulations in Stockmeyer and Vishkin (1984) show that with a restricted instruction set a CRCW PRAM with a polynomial number of processors does not need more than a polynomial number of memory cells. The bounds in Theorems 5.1 and 5.2 appear to be incomparable since they restrict different resources -one restricts processors but not cells and the other restricts cells but not processors. It seems worthwhile elaborating general relationships between the number of processors and the number of cells in the case of unrestricted instruction set machines.
The most interesting open question concerning the Abstract CRCW PRAM is the following: Are there specific Boolean functions for which we can prove stronger lower bounds than those for parity -in particular are there non-trivial lower bounds which match the values for the best known algorithms?
