For the tirst time we separate the widely used shared-memory models of parallel computation, COMMON(m), ARBITRARY(m), and PRIORITY(m), for small m (the communication width), without the restrictive assumption that each processor has access only to one input. Rather we follow S. A. Cook and C. Dwork (1982, in "Proceedings, 14th ACM Symp. on Theory of Computing, 1982," pp. 231-233) and U. Vishkin and A. Wigderson (1985, SIAh4 J. Comput. 14, No, 2, 303-314) who assume that the inputs are given in a read only memory (ROM). The previous separation results of F. Fich, P. Ragde, and A. Wigderson (1984, in with ROM and ARBITRARY( 1) (even without ROM). We also generalize a technique of (Fich, Ragde, and Wigderson, op cit.) to separate ARBITRARY(m) with ROM and PRIORITY( 1) (even without ROM), These two separation results are tight. Fich. Ragde, and Wigderson (ibid.) have previously obtained tight separation results for the corresponding models without ROM. Also we settle a conjecture of Vishkin and Wigderson about the parallel time (depth) needed to compute PARITY nondeterministically on a PRIORITY( 1). They conjecture that the lower bound is Q(h) which is the same as the deterministic tight lower bound. We prove a nondeterministic upper bound of O(n'/l). We also prove a tight lower bound of Q(A) for PARITY on nondeterministic PRIORITY( 1) without ROM and a lower bound of Q(log log n) for PARITY on nondeterministic PRIORITY( 1) with ROM. ( ' lY87 Academic Press, Inc.
1. INTRODUCTION In this paper, we prove new separation results between shared memory models of parallel computation. The models we consider are variants of the parallel RAM (PRAM), which differ from each other in the way they restrict simultaneous writing into the same shared memory address. As pointed out in (Fich et uZ. 1984) there are algorithms on all of these models in the literature. Hence it is important to know whether these models differ in power.
A PRAM consists of a set of processors p(i), i= 1, 2,... which are random access machines (RAMS), a collection of shared memory cells C(i), i= 1, 2 ,... and n read only input cells (ROM), X(l), X(2) ,..., X(n). Each step of the computation consists of four phases:
(1) each processor reads from some ROM cell; (2) each processor reads from some shared memory cell; (3) each processor performs a computation; (4) each processor may attempt writing into some shared memory cell.
Whenever more than one processor simultaneously attempt writing into the same shared memory cell there is a write conflict. The various variants of the PRAM differ in the way they handle write conflicts. These variants are:
(1) CREW (concurrent read exclusive write). In this model it can never happen that at the same step more than one processor attempt writing into the same shared memory cell.
(2) COMMON. In this model it is required that all the processors which at the same step attempt writing into the same shared memory cell write the same value.
(3) ARBITRARY. In this model, among all the processors which at the same step attempt writing into the same shared memory cell, an arbitrary one succeeds.
(4) PRIORITY. In this model, among all processors which at the same step attempt writing into the same shared memory cell, the one with minimum index succeeds.
If the PRAM computes a function, the depth of the PRAM is the number of parallel steps used to compute the function.
All the above models are widely used for implementing parallel algorithms. For example, (Hirschberg ef ul., 1979) use CREW, Shiloach and Vishkin, 1981; Galil, 1984) use COMMON, (Shiloach and Vishkin, 1982) use ARBITRARY, and (Awerbuch and Shiloach, 1983 ) use ARBITRARY and PRIORITY.
As mentioned in (Vishkin and Wigderson 1985) , the case in which the number of shared memory cells is 1 is of particular importance. (Vishkin and Wigderson, 1985) point out that the "Ethernet" can be considered as a PRAM with only one shared memory cell. They also mention that (Gottlieb et al, 1983; Kuck, 1977; Vishkin, 1980) imply that minimizing the size of shared memory may amount to hardware feasibility of the parallel machine.
Having this in mind, we continue, in the line of research of (Cook and Dwork, 1982; Vishkin and Wigderson, 1985; Fich ef al., 1984; Fich er al., 1985) , to prove lower bounds on depth and separation results for models with one shared memory cell or a constant number of shared memory cells. For X any of the above models, let X(m) be the model with m shared memory cells. While (Cook and Dwork, 1982; Vishkin and Wigderson, 1985) only consider models with ROM, (Fich ef af., 1984; Fich el al., 1985) assume that the inputs are not given in a ROM. Rather, they assume that processor P(r) can only read input X(i) for i = 1, 2,..., n. For models without ROM, (Fich et al., 1984) give tight separation results between COMMON(m) without ROM and ARBITRARY(l) without ROM, and between ARBITRARY(m) without ROM and PRIORITY( 1) without ROM. (Fich et al., 1985) give a tight separation between COMMON(m) without ROM and PRIORITY(m) without ROM for m < nC, FJ c 1. However, the situation for models with ROM turns out to be much more complicated. (Note that in the above previous results m = o(n), hence the shared memory is not large enough to save even a constant fraction of the input.) It can be easily shown that the models with ROM are more powerful than models without ROM, if the amount of shared memory is o(n). Hence we introduce a new technique which for the first time enables us to separate between COMMON(m) with ROM and ARBITRARY( 1) (even without ROM), and generalize a technique of (Fich et af., 1984) to separate between ARBITRARY(m) with ROM and PRIORITY( 1) (even without ROM). Both separations are tight. We also note that, for instance, some functions which require log n depth on COMMON( 1) without ROM can be computed in constant depth on COMMON(l) with ROM. As pointed out in (Vishkin and Wigderson, 1985) the introduction of ROM in models with o(n) shared memory is similar to a read only input tape in off-line o(n) space bounded Turing machines. A major motivation to study PRAMS with ROM is to distinguish between communication and information sharing.
We then proceed to investigate nondeterministic PRAMS. (Vishkin and Wigderson, 1985) ask about lower bounds for nondeterministic PRAMS and conjecture that the Q(h) 1 ower bound on the depth of a PRIORITY( 1) with ROM which computes the PARITY function holds also for nondeterministic PRIORITY( 1) with ROM. We show an O(n'j3) upper bound for PARITY on this model. We then prove a tight lower bound on the depth for computing PARITY on nondeterministic PRIORITY( 1) without ROM, and also prove a lower bound on depth for PARITY on PRIORITY( 1) with ROM. Again, the case with ROM is much more difficult. We summarize our results in Tables I and II. 2
. PRELIMINARIES AND DEFINITIONS
We start by delining deterministic models of synchronous parallel computers with shared memory. (See also Cook and Dwork, 1982; Vishkin and Wigderson, 1985; Fich et al, 1985 .) The models consist of a collection of processors which can read from and write into shared memory. The only communication among the processors is through the shared memory. The various models differ in the way they resolve write conflicts in which more than one processor attempt writing into the same shared memory cell simultaneously. By iV we denote the set of positive integers { 1, 2,...}. DEFINITION 2.1. A PRIORITY PRAM (PRIORITY, in short) consists of a set P = {P(l), P(2),...} of processors, a number n of inputs, H read only input cells (ROM) X(l), X(2),..., X(n), a set C(l), C(2),... of shared memory cells, an alphabet ,Z (usually intinite), and a depth T. Each processor P(i) has a set of states Qi and functions: Q, + NXZ u { NOWRITE} the location of the next shared memory cell to be written into by P(i) and the symbol to be written NEXTSTA TEi: Q ;XZXZ -+ Q ; the next state of P(i).
The model operates in steps.
Step t for t = 1, 2,..., T occurs in time period t. At time t = 0 the input cell X(i) (i = 1, 2,..., PZ) contains the &h input x,, all the shared memory cells contain a distinguished symbol us Z, and every processor Pj is in a ditinguished state qi,o in Q,, which is called an initial state.
At step t each processor P(i) is in state q,,, in Q, and each shared memory cell C(j) contains a symbol Si,, in Z.
The qi,,, S;,, for t = 1, 2 ,..., T are determined from the q;,, , , S;,, l as follows:
(1) For all i, P(i) reads from input cell X(XREAD, (qi,rm ,) reads from shared memory cell C(CREADAqi,l~ ,I 1.
(3) For each Jo N determine the set WR,,, of indices of the processors which attempt writing into shared memory cell C(j) at time t. Formally, let: WR,,, = {i 1 WRZTE;(qj,l) is of the form (j, b) for some /J in Z}. Then for each C(j) such that WRj,+ is nonempty lind the minimum index in WRj,. This is going to be the mdex of the processor which will actually write'into C(j) at step t. Formally for all j such that WR,,, is nonempty, let Wj,r = min( WR,.f) and let WRITEi(qi,, , ) = (j, u,) , where i= Wj,r. Then, S,,,= aj. F or all j such that WR,,, = @, S,,, = S,,, ~~, .
(4) For all i, processor P(i) changes state according to the values P(i) has read from the ROM and from the shared memory. Formally Qi,, = NEXT,STATE;(q,,,~, , xu, Sv,( ,) , where u = XREADJq,,, , L 0 = CREADAqi.t ~ 1).
In the above detinition, whenever i is in WR,,! we say that processor P(i) uttempts writing into shared-memory cell C(j) at time t. By (3) above, for each t and j, among all the processors which attempt writing into C(j) at step t, the one with the minimum index (which we call W,,,) succeeds. There are other models which differ in the way the processor which succeeds in writing is determined: ARBITRARY, An arbitrary processor among those pocessors which attempt writing into the same memory cell in the same step is selected.
COMMON. This model is restricted in such a way that all the processors which at the same step attempt writing into the same sharedmemory cell, attempt to write the same value and this value will be written into that memory cell.
CREW (concurrent read exclusive write). This model is restricted in such a way that for all t and j at most one processor attempts writing into C(j) at step t. If there is one such processor it will write. For t = 0, l,... the ~~WZ<V.V ~ZUJJ is the vector H, = {S,., 1 j = I, 2 ,... }, where S,,, is the contents of C'(j) at time t. The history through step T of the computation of the parallel model M on input (x,, .x2,..., x~) is the vector HO, H,,..., HT which results by letting X, be the contents of J'(i) (1 < ~<Pz) and determining qi,,, Sj.( (t = 0, l,..., T) as above.
Less powerful variants of the above models were studied in (Fich et al., 1985; 1984) . For A4 any of the above models, A4 without ROM will denote the corresponding model for which the number of processors is equal to the number n of inputs, and such that rather than reading the input from a ROM, the cell X(i) belongs to P(i) which knows its contents x;. But P(i) cannot read from X(j) for j # i.
For proving upper bounds, we sometimes only describe an algorithm for the model without ROM, since the corresponding model with ROM can simulate a depth T model without ROM in depth T+ 1, using n processors and the same amount of memory cells. At the lirst step simply let P(j) read ,yi from X(i).
Let M be any of the above models, with or without ROM. Let D be any domain, and let j" be any function delined on D". We say that A4 computes f in depth T if on every input 2 in D", A4 will have j(Z) written in C( 1) at step T. Formally, Sj.T= j(Z). It is important to notice that for the ARBITRARY model we require that ,f(.f) will be written in C( 1) at step T, no matter which processors succeeded in writing at each step.
As usual, a language L, L s D" is recognized by A4, if M computes its characteristic function.
Another important consideration is the communication w'dth (width, for short). For M any of the above models, M(m) will denote model M restricted to having only m shared memory cells C( 1 ), C(2),..., C(m).
We now define nondeterministic models of parallel computation. Using A4 for any of the previously defined models, NM will denote its nondeterministic variant. The nondeterministic variant of M is a generalization of iV, obtained by fixing a constant d > 2 (called the branching factor of M) and such that the next state function for each processor is a function whose range is all d-tuples of states. Formally, using the notation of Section 2, NE WSTA TEi is a function from QJZXZ into Qf. Also qi,, may be any element of NEWSTATEi (q+,, xU, So,,+,) . If NEwsTATEi(qi,,-,,~~,s~,,,-,)=(e~,e~ ,..., ed) ad qi,t=ej (l<j<d), we say that at step t, P(i) chose branch number j. We also use the word computation to denote a particular action of M on some input, by having P(i) choose some branch number at step t, for i = 1,2 ,..., t = 1, 2 ,....
We also deline a subset ,Z* of the symbols in Z to be the set of accepting symbofs. If jV has depth T, a computation of M is called accepting if at step T, C( 1) contains an accepting symbol. M of depth T is said to compute a function j if all symbols in ZA are of the form (,4, y) , where A is a special character, and, for all input 2, (i) there is an accepting computation of h4 on -? for which at step T, C( 1) contains (A, f(f));
(ii) every accepting computation of IV on .? has at step T, (A, f(Z) ) in C(1). Now let L be any language, say L G L)". We say that the nondeterministic model M accepts L in depth T if on any input in L there exists at least one accepting computation of M, and for any input not in L there is no accepting computation. Fich et al., 1985) . Let f be a surjective function from Z' (Z = {O, 11) onto R. Then any COMMON( 1) without ROM which computes f requires depth at least log3 iR\.
They consider the function INDEX(xI, x2,..., x,,) which is delined on L"' (Z = {O, 1}) as follows: ZNDEX(xI, x2,..., xn) = maxi j 1 xi = 0 for all 1 < i < j}. Since INDEX is surjective onto { 1,2,..., n + 11, INDEX requires at least logjn steps on COMMON(l) without ROM, by Theorem 3.1. Since INDEX can be computed in constant depth on PRIORITY( 1 ), the separation result of Fich et al. (1985) follows. We now show that Theorem 3.1 is not true for COMMON ( 1) with ROM. Consider the function Zl(xi, x2,..., x,,) delined as follows: if there exists i (1 < i < n -1 ), such that xj = 1 and xi+ i = 0 then Zl(xi, x1 ,..., x~) = n + 1. Otherwise, ~lhl, X2Y.Y xn) = ZNDZX(x,, x2 ,..., xn).
Zl is surjective onto { 1, 2 ,..., n + 1 1. H owever, we prove the following FACT. Zl cun easily be compufed in conslant depfh on a COMMON( 1) wilh ROM, using n processors.
ProoJ In parallel, P(i) reads X(i) (i = 1, 2,..., n). Then in parallel, P(i) reads X(i+ 1) (i= l,..., n -1). Then any processor P(i) which reads (X(i), X(i + 1)) = (1,O) writes n + 1 into C( 1). Then all the processors read from C( 1). If they read the value n + 1 then the computation halts. Otherwise the (unique) processor P(i) which reads (X(i), X(i+ 1)) = (0, 1) writes i+ 1 into C(l), or P(1) writes 1 if X(l)= 1.
Fich et al. (1984) previously proved a tight separation between COM-MON(m) without ROM and ARBITRARY( 1) which does not rely on Theorem 3.1. However, a ROM makes the situation much more complicated, and the technique of (Fich et al, 1984) apparently does not generalize to the case of having ROM. We have to introduce a new technique to prove the lower bound for COMMON (m) with ROM. The technique is based on an adversary argument which is used to prove a lower bound on COMMON(l) with ROM for the language of threshold-2-function: L2 = {(x1, x*,..., xn) 1 xie {O, 1} (1 <i<n) andforsomei,j, l<j<i<n,~~=~,=l].
It is easy to see that L2 can be recognized on ARBITRARY( 1) with n processors, even without ROM, in constant depth. We also note that a COMMON(l) with ROM and Q(n*) processors can recognize L2 in a constant depth, by reading all pairs (xi, xj). If, however, the number of processors q satislies q 6 no, 0 < a < 2, we prove a lower bound of Q((2 -u) log n) on the depth of a COMMON( 1) with ROM recognizing L2. Hence L2 cannot be recognized in constant depth on a COMMON( 1) with ROM and at most # processors with a < 2. Hence we get the separation result. THEOREM 3.2. Any COMMON( 1) wilh a ROM and q = n" processors (0 < u < 2) which recognizes L2 requires depth Q((2 -u) log n).
ProojI Let M be a COMMON (l) with ROM which recognizes L2. We will deline, for f = 0, l,..., sets S,, PFI~, and Z,. St and P,4( represent con-straints on the input, which are imposed by an adversary at step t. Z, is the set of all inputs which satisfy the constraints represented by S, and PA,. The constraints are chosen by the adversary in such a way that (1 ) All inputs in Z, have the same history through step t.
(2) Furthermore, if t is not large enough, then Z, will have to contain an input in L2 and an input not in L2, which implies that kZ cannot have depth t.
More specifically PA, is a subset of { 1, 2 ,..., n] x [ 1, 2 ,..., ~1, and S, is a subset of { 1, 2 ,..., n } and Z, is always defined by Z, = {(x,, x2 ,..., x~) 1 .X~E 10, 11 for i= 1, 2 ,..., n and for some i in S,, x, = 1, and for all f not in St x, = 0, and there is no pair (j, k) in PA, such that x,=.Y~ = 11.
We construct St and PAt by induction on the number of steps ?. For t = 1, 2,... we will always have S, G S, , and PA, , G PA,. Hence Z, G Z, , (l= 1, 2,...).
THE INDUCTION HYPOTHESIS.
(1) All inputs in Z, l haue the same histuq, through step t -1.
(2) All pairs (j, k ) such that j # k and some processor un some input t-y, ,..., .y,,) E 1, , reads during the,fZrst t -1 step.yfrom ROM positions X(j) and X(k) such that x, = .Y~ = 1, are in PA, , .
We define S" = ; 1, 2,..., ?Z j and PA0 = 4.
Assuming that S,-, and PA, I are defined and satisfy the induction hypothesis, define PA, and S( according to the behavior of -44 on the inputs from Z, , during steps 1 through t. Let PA,=PA, , u {(j, k) 1 j# k and on some input (.x,, .x2 ,..., .x,~) in Zt , , some processor, during steps 1 through t, reads ROM positions X(j) and X(k), and x,=xk = 11. (2) of the induction hypothesis for t -1 would imply (2) of the induction hypothesis for t, provided that Zr G Z, , . Clearly always (j, k) E PAT if and only if (k, j) E PA(. Now if processor P(l) on input (x,, .x2 ,..., x~) in Z, ~, reads from ROM positions X(i,), X(i?),..., X(i,-, ) during steps 1 through t -1, by definition of PA,-,, Zr , and the induction hypothesis, if (f, k) are distinct indices in {i,, i2 ,..., i,~-, i we cannot have x[= .xk = 1. Let, for r = 1, 2 ,..., t -1, Z, , [f, r] = { (.I-, , .x2 ,..., .x~) 1 (x ,,..., x,~) in Z,--, and at step r, P(l) on input (x, ,..., .Y~) reads from a ROM cell which contains a 1 1.
ProojI Since on any input in Z, , , Z'(f) cannot read a value 1 from more than one ROM cell during steps 1 through t -1 and the history on all inputs in Z, .~, is the same, we conclude that all inputs in Z,-, [Z, r] cause P(f) to go through the same sequence of states during steps 1 through t -1. Hence on all inputs in Z,-, [Z, r], Z'(Z) will read from the same ROM locations during steps I through 2. By the above argument, each processor P(Z) can contribute at most 2( I-1) pairs to PA, which are not in PA, ,. Since [PAcj/ = 0 we get by induction iPA,/ < qf(f -1) < q/' for f = 1, 2,.... 1
We now would like to define St in such a way that all inputs in Z, will have the same history through step f and such that ZI G Z, , . Detine
Intuitively, every input (x,, .x1 ,..., x,?) in Z, , such that X, = .Y~ = 1 for some (j, k) E PA, is already confirmed to be in L2 by IV, and cannot help in our adversary argument. We now consider three cases. Case 1. On some input .? = (x,, ,vl ,..., x~) in Z,, some processor during steps 1 through r reads only O's from the ROM, and writes in step r. Let U be the set of ROM positions from which P(Z) has read on input .? during steps 1 through t. Let S, = St ~, -U. Since 1 U] < t, we have IS,1 > St ~, I-r. Also, S,GS~+, and PA, ,sPA,. Hence Z,GZ,GZ,+ ,. Now let .?' = (x', , .x; ,..., XL) be any input in Z(. Since Z, G Z,-, , % and *f' have the same history through step t -1. Hence by induction on the step number, the set of ROM locations from which P(l) reads on 2' is also U. Also .K; = 0 for all z' in U. Hence on .$, P(l) writes at step t the same values as on .f. We conclude that ail inputs in Z, have the same history through step r.
Case 2. There is a set Us S, ~, such that 1 U[ < IS!+, l/2, and by fixing all input positions in U to 0 no processor will write at step f on any input in 1, which has value 0 in all these tixed positions. Finally, if we let S,=S,-,-U, no processor will write at step t on any input in Zr. Hence all inputs in Z, have the same history through step r. Clearly Z,LZ,~ ,.
Case 3. Neither one of the Cases 1, 2 holds. Consider the subset S\ of S, -. , defined by S; = {i 1 i in S,-, and on some input (x, ,..., x~) in I[ such that -xi= 1 some processor reads from ROM position X(i) during one of the first t steps, and writes at step t].
Suppose that IS;] < IS,-, l/2. Then, since Case 1 does not hold, Case 2 must hold with U = Si. We conclude that IS:1 2 IS, , i/2. Let W* = {(i, j) 1 on some input in 1, some processor by step t read a 0 from .I'( i) and 1 from X(j) or read 0 from X(j) and 1 from X(Z). } CLAIM 2. 1 W,i < 2qt2.
Proof Since there are q processors, it is enough to show that each processor contributes at most 2t2 pairs to W,. This is true since no processor read a 1 from two distinct ROM locations on any input in If. Hence there are at most Z possibilities for the vector of locations from which the processor read, and each vector contributes at most 2t pairs to wt. 1
As a guide in constructing St, we detine i, j in Si such that i # j as independent if (i, j) and (j, i) are not in PA, u W(. Now we will construct S, by keeping an index & in Si which participates in a minimum number of pairs in PAZ u W!, and then deleting from S: all indices i such that ( ZO, z'), (i, iO) g PA, u W,. Formally,
The selection of i0 is done as follows: For ie Si, let W(Z) be the number of pairs of PAr u Wt in which i is a member. Now, consider xiG s; w(Z). Each pair in PA, u Wt contributes at most 2 to this sum, and 1 PA, u W,/ < 3qt'. Hence xiG s; w(i) < 6qt2. Hence there exists an index & in Si which is a member in at most 6qt2/[Sjl pairs in PA,. Define S, using 4, as above. According to our previous notation 1, = {@I ,...> x~) 1 for some iin St xi= 1, for alljnot in St xi=O, andforall(k,Z)inPA, (x~,x,)#(~, l)}.
CLAIM 3. All inputs in Z, have the same history through step t.
ProojI Using the induction hypothesis, all inputs in Z, have the same history through step t -1. By definition of S, and Si, there must be an input ,? = (x1 ,..., x~) in Z, and some processor P(Z) such that on input K P(Z) reads a 1 from X(&) during steps 1 through t, and writes at step t. W-v-e hat {jl, j2,..., jr) t r < t are the distinct ROM locations from l J which P(l) has read during the first t steps on input 2, such that j, = &,.
Then by detinition of iO and W(, j2 ,..., jr are not in S,. Hence for all j = ( yr ,..., yn) in Z,, Y,~ = 0 (k = 2 ,..., r). Hence on all inputs ( y, ,..., ya) in Z[ for which yiO = 1, P(l) reads the same values from the ROM during the first t steps, and hence writes the same value at step t on all of them. Now consider an input j = ( yI ,..., yn) in Z, such that yi,, = 0 and y,, = 1 for some il in S,. Some processor P(r) on input j reads from distinct ROM locations {kI ,..., kc} during the tirst t steps, where k, = iI, and writes at step t. Since ii E St, &, is not in {k, ,..., ke} by the delinitions of i0 and W*. Let j' be the input obtained from j by changing the value of yi,, to 1. Since (iO, i) is not in PA, for all i in S,, j' E Zt. On j', P(r) will still write at step t the same value as on ~7, hence j, j' have the same history through step t. But then, by the previous discussion, all inputs in Z, have the same history through step t. 1 
Prooj
In Case I, IS,1 2 IS,-,1 -la'7 lS~~,l/~~lS,~Il/4 Cshx t</Srp,l/8).
In Case2, /S,l>lS,PJ2. In Case3 (using IS;1 24 [,SP,l), 6qt2 IS,1 2 IX -jq = IW 3 l&II2 P-11 3 P-II22 -32 ls;l q--32 IS,-,I
CLAIM 5. There exists n0 such that for all n >nO and O<t<((2-a)/8)logn, ~S,~~n/4'and~S~~*~4~P~~~.
We will choose no such that (2 -a) log n < nC2-u"4 for all n > no. We now prove the claim by induction. ISol = n. Suppose l,S-, 1 2 n/41p '. Then Since L2 can easily be recognized in constant depth on an ARBITRARY( 1) with n processors, we have THEOREM 3.3. L2 separates COMMON( 1) with ROM und rz" processors, 0 <a < 2, ,fiom ARBITRARY( 1) with n processors, with or without ROM.
We also note that for 1 < 0 < 2 processors the bound is tight up to a multiplicative constant, since COMMON( 1) with at least ti processors can recognize L2 in depth 0(log n) by using binary search (See Sect. 4) to tind the minimum i such that x, = 1, and then having any P(j) such that j# i and x, = 1 write into the shared-memory cell. In fact, with ROM, a COM-MON( 1) with n/log n processors is sufficient.
We now generalize our result to more memory cells. COROLLARY 3, 4 . For all m > 1 and 1 <a < 2 there is a tight Q(logn/log(m+ 1)) separation between COMMON(m) with ROM and n" processors, and ARBITRARY( 1) with n' processors.
Proox The separation is tight since (Fich et ai., 1984) show how COM-MON(m) with q processors can simulate one step of ARBITRARY( 1) with q processors in O(log q/fog(m + 1)) steps. In our case q = na, hence logq = a log n. Theorem 3.2 gives the separation for m = I. The generalization of Theorem 3.2 to m > 1 is done by a similar adversary argument. At step t, we try to prvent any processor from writing into C(i) at step t by fixing a not too large fraction of the unfixed input positions to 0. For each C(i) for which this is impossible, find a set S;(i) of unlixed input positions such that by having a value 1 will cause some processor to write into C(i) at step t. Since the previous case did not hold for those C(i), one of those S;(i) must have IS;(i) -(Jj+i S;(j)1 large enough so that we can fix all the positions in S:(j) (for all j# i) to 0 and proceed as in the proof of Theorem 3-2. l
SEPARATION BETWEEN ARBITRARY(m) WITH ROM AND PRIORITY(l)
In this section we show for the tirst time that a certain simple function which can be computed in constant depth on PRIORITY( 1 ), even without ROM, requires depth at least T, where T+fog T>logn on ARBlTRARY( 1) with ROM, and, in general, Q(log n/log(m + 1)) depth on ARBITRARY(m) with ROM. (Fich et al., 1984) shows an Q(log n) tight separation between ARBlTRARY (1) without ROM and PRIORITY(l) and, in general, an Q(log n/log(m + 1)) tight separation between ARBITRARY(m) and PRIORITY(l),
We generalize their technique to prove a tight separation between ARBITRARY(m) with ROM and PRIORITY(l) even without ROM. For a given ARBITRARY( 1) with ROM, the adversary fixes bits of the input to prevent any computation of a certain depth from being able to compute the function.
The bound we obtain for ARBITRARY{ 1) with ROM is tight up to an additive constant, and does not depend on the number of processors.
In the following theorem, by "All the inputs in some collection S haue the same history through step t" we will mean: There are indices f, , fZ,..., 1, such that on afl inputs in S there is a possibfe computation in which processor 1, will actually write in step j (Z, = 0 means no processor attempts writing at step j on any input in S) and the history will be the same on all inputs in S. THEOREM 4.1. Zf IhJDEX(x! , x2 ,..., x") is computed in depth T otz ARBITRARY( 1) (with ROM) then T+ log T+ 1 > log H.
ProojI We use an adversary argument, which generalizes the technique of (Fich et al., 1984) to the case of having ROM. Let M be an ARBITRARY( 1) which computes ZNDZX(xi, x2,..., x~). We dehne subsets &, s, ,..., STOf En (,E= {O, l}) such that for all t, all inputs in S, have the same history through step t. These subsets are delined by induction. S,, = ,E" (2 = {O, 1 } ) is the set of all possible inputs. We inductively deline subsets Kt, J, of 11, 2,..., n}, where K, is the set of input positions with values fixed to 0 by step t, and Jt is the set of input positions with value fixed to 1 by step t. We will also deline indices Z,, &,..., I[ by induction.
We will let J,, = K,, = 4 and for t = 0, l,... St = {(x,, .x2 ,..., x,,) 1 X~E {O, 1 } (j = 1, 2,..., n) and xj = 0 for all i in K* and xi = 1 for all i in .Zl}. S,, Kt, J[ are delined inductively from St ~ , , K, , , Jr 1 . We inductively assume that all inputs in St-, have the same history through step t -1, where processor Ij actually writes at step ,j on all those inputs (1 < j < t -1). If lj = 0 then no processor attempts writing.
Let ArP1={l,2 ,..., rz}-(K,-,uJt+,). Let ZIP,,RtP, be such that A r-,=L,uR,.
1, every index in L,-, is lower than any index in R,-, and IL,-,1 =+ [A,+lj. Intuitively, Lt~.l is the low half of Alp, and R, , is the high half of A, ~ r . Several cases arise. If however, at step t some processor writes on some input in St-, , then to every input i = (x,, x2 ,..., xH) in S, -1 and every processor P(l) which writes on .? at step t there correspond two sets: U(T, Z) = {i 1 iE A, l, x, = 0 and during steps 1 through t P(Z) has read from X(i) }; V(Z, Z) = {i 1 ~EA,~,, xi= 1 and during steps 1 through t P(l) has read from X(i)).
Clearly 1 U(% Z) + V(& Z)i < t. We now have
Case 2. For all 2 in St-, and I as above, I'(.$ Z) n L,-, is nonempty. Intuitively, this means that each processor, in order to write at step t on input (x1, x2 ,..., x~)
. in S,-,, requires that some bit in the lower half of the unlixed positions will have value 1. In this case, the adversary fixes all the bits in this lower half to 0, thus making sure that no processor will write at step t on any input. Formally: K, = Ktp , u L,-~ 1, .Z, = .Z-1. Clearly all inputs in St have the same history through step t. Let 1, = 0. Case 3. There exists 2, 1 as above such that V(-f, Z) n L,+ 1 is empty. Intuitively this means that all bits xi of 5 which were read by P(1) by step t and are in the lower half of the untixed positions have value 0. In this case the adversary fixes those bits to 0 and actually fixes all the bits in U(& 1) to 0 and all bits in I'(%, Z) to 1. Then the adversary lixes all unlixed positions in Rz-, to 0. Formally, Kt = K,p I u U(Z, /) u (R,p, -I'(*, f)), J,=Jtp,u V(,f,f). Let lt=l.
Since S, G S,-i, by induction all inputs in S, have the same history through step l-1, P(f) will read from the same ROM positions on all inputs in S,, and will read the same values. Hence all inputs in St have the same history through step f.
We now count the number of positions which are untixed by step t. This number is /A,/, which wc denote by u,. Now uO= iA01 =n. In Case 1 we have Us = U, ~ , . In Case2 ula(u,-i)/2.
In Case 3 ~,>i(~,~~i-l)-f. Now we prove by induction that Us > n/2'-2t. Since u0 = n this assertion is true for t = 0. Assuming that
Now if t + log t + 1 <log n then 2'2t c n, hence n/2'> 2t, hence n/2' -2t > 0, hence U, = [AJ > 0. So A, is nonempty which means that there is at least one unfixed position. Now let ir = min(At). By the above construction, for every j E J, we have iz <j. In other words, no input position which is lower than some position in A, is ever fixed to 1. Now, INDEX can clearly be computed in a constant depth on PRIORITY(l), even without ROM. Hence we get a separation result. Next we show that the lower bound obtained for INDEX is tight, up to an additive constant. THEOREM 4.2. Whenever T-I-log T > log n, INDEX(x, , x *,..., x") can be computed on COMMON( 1) wilh ROM, and hence b-v an ARBITRARY( I ) with ROM, in depth T + 2, using n processors. For example T can be logn-loglogn+ 1.
Proo$ The COMMON( 1) will perform a binary search for min{ z' 1 sxi = 1 } with a speed-up using the ability of each processor to read any ROM cell x(j) in one step. For the sake of simplicity, we demonstrate the idea for the case in which n = 22k+', j< k, T= 2k. In this case T+ log T= 2k + k > 2k + j = log n. The proof of the general case is an obvious generalization. The COMMON(l) with ROM which computes lNDEX( x, , x2 ,..., .x~) in at most T + 2 steps uses the following idea.
Assume that JJ 1 ), J(2),..., X(n) and P( 1 ), P (2),..., P(n) are ordered from left to right.
The following obvious PARALLEL BINARY SEARCH algorithm can for n = 2" in I+ 1 steps lind an interval Jh If we would have a COMMON(l) without a ROM it looks like we would have to use I = zl and the binary search would take u + 1 = log n + 1 steps. Fich ef ul. ( 1984) have obtained for COMMON( 1) without a ROM a lower bound of log2 U. We now show how the ROM can be used to compute the INDEX in 2k + I = T+ 1 steps. We choose I = T= 2k, while u = log n = 2k + j. The idea is that in parallel to the execution of PARALLEL BINARY SEARCH, for h = 1,2,..., 2', P((h --1 )2Um ' + 1) can all in 2umlz22k I/ -2k = 2j < 2k = T steps read in parallel x(j) for iz(h-l)2"-' + 1, (h -1)2' ' + 2,..., h2' --' in that order, thus determining mini i 1 in JA and xi= 1 }. Now at step T + 1 an integer hO is determined by the binary search, such that INDEX(x ,,..., .x,,) is in JhO unless INDEX(.Y~,...,,Y,~)=~+ 1. At this point P((hO--1)2'-'+ 1) knows I!VDEX(x, , .y> ,..., .x,,) and writes it into C( I ). 1 Theorem 4.2 implies that the lower bound of Theorem 4.1 is tight up to an additive constant. Also note that for n = 22k+j, j < k, a straightforward binary search approach with l= log n = 2'+j will take 2k + j+ 1 steps, while our algorithm will use at most 2k + 2 steps.
We now generalize our separation to rn shared memory cells.
COROLLARY 4.3. For all m 2 1 there is an Q(log n/log(m + 1)) depth separution between ARBITRARY(m) with ROM and PRIORITY(l), and this separation is tight.
Proofi Theorem 4.1 shows the separation for rn = 1. The case m TG-1 is a generalization of Theorem 4.1. The main idea is to divide the unlixed input positions into at most m + 1 (rather than 2) consecutive blocks. We omit the details. The fact that the separation is tight follows from a log n/log(m + 1) simulation of PRIORITY ( 1) (with ROM) and conjectured an Q(d) lower bound. In this section, we introduce new and interesting techniques which enable us to establish lower bounds on the depth of NPRIORITY( 1) without ROM, and NPRIORITY (1) with ROM which compute PARITY. We also disprove the conjectured Q(,,L) lower bound of op cit. by presenting an O(n1'3) upper bound for computing PARITY on NPRIORITY( 1) with ROM. We start by considering NPRIORITY(1) without ROM. PARITY by an NPRIORITY( 1) without ROM, where n is the number qf processors und inputs, and d> 1 is the branching factor.
ProofI We actually prove that the above depth is required even for accepting the languuge of even parity strings. For each input of even parity, fix one accepting computation.
We will partition the inputs according to different accepting computations which are chosen for each input.
Step r of a computation is characterized by the triple (it, input, b) , where i, is the index of the processor that succeeds in writing at step t (w.l.o.g., i, = 1 if no processor writes in step r). b is a vector (dI, d?,..., d,) meaning that processor P(i,) at step k (k = 1, 2,..., l) used branch dk, and the input is the input bit to P(ir). In this proof, by computation we will mean a sequence of triples as above.
The number of such triples at level k is 2ndk, where n is the number of all possible choices of it; 2 is for two possible binary inputs; dk is the number of different sequences of branching numbers which may be used by Z'(i,) up to level /L CLAIM.
At level k, there is a group Gk of 2"J2nk2k dkCk+ 'V inputs of even parity having the same accepting computation up to step k, That is, the first k tuples for the accepting computations fixed to all the inputs in Gk are the same.
Proof (of claim). There are 2*/2 inputs of even parity. The number of distinct computations up to step k is at most nf= , 2ndj = nk2k dkCk + "12. The claim follows by picking up the largest set of inputs corresponding to one computation. 1
Now we use this claim to prove Theorem 5.1. Consider the accepting computation delined in the claim. Assume that the computation depth is t. If iG,l > 2, then at least two different inputs of even parity have the same accepting computation. Fix two such inputs Z1, Zz. Z1 differs from Zz in at least one position. Choose such a position p in Z1. Change its value to the opposite value (Z2's corresponding value). This will not change the history for the following reasons:
(a) This bit is not one of the inputs for processors i,, &,..., i, since those are already fixed to be the same for Zr and Z?.
(b) The processor P(p) will not influence the history since for both Z1 and Z2 P(p) did not do so.
(c) Other processors (except for P(p)) will not be influenced since the history of the computation is the same and their input bits are not changed.
Therefore, we conclude that the NPRIORITY ( 1) guesses yj = ( yjl, yJT ,..., yjk) (j = 1, 2 ,..., k), and computes ZJ = xf= r yji (mod 2) as it guesses the yj;s. Then, sequentially, for j= 1,2,..., k, P(jk) writes its guess into the shared-memory cell, and in parallel for I = 1, 2,..., k, P(( j-1)k + 1) writes a special symbol R into the shared-memory cell if x+ , 1k + , # yj,. Every processor reads from the shared memory. If any processor reads R, the computation halts. After these 2k parallel steps, if no processor writes R into the shared memory, it is confirmed that (x1, x2 ,..., x~)= (y, , , ylz , ..., ylk, *v?~ , ..., y2k , ..., ..., JJ~~) . Hence x;= 1 xi = xF=, Zj (mod 2). Now P(k) writes Zl into the shared memory. Then for j= 2,..., k in this order, P(jk) reads the contents of the shared memory, adds Zj to it (mod 2) and writes the result into the shared memory. After these k steps, Z = xF= r Z, (mod 2) is written into the shared memory. P( 1) reads Z and writes (A, Z) into the shared memory. 1
We now show that PARITY can be computed on NCOMMON (1) with ROM in O(n1j3) depth, thus disproving a conjecture of Vishkin and Wigderson (1985) . THEOREM 5.3. PARITY can be computed in depth O(n1'3/log d) on NCOMMON( I ) with ROM, using n2'3 processors and branching factor d.
ProoJ For simplicity of exposition we assume rr = u3 and d= 2. Partition the ROM positions into u2 equal distinct groups of u bits each. In u =rr1j3 parallel steps each of the u2 processors reads the bits in one of those groups and computes their sum modulo 2. Different processors read different groups. Let P(i) have sum yi (j= 1,2,..., u*). The output should be zc r yj (mod 2). Using the algorithm of Theorem 5.2, this sum can be computed nondeterministically in &? = u = n'13 parallel steps. i
In the following proof, a labeled hypergraph (V, E) is given by a set k' of vertices and a set E of edges where every edge is of form e = ( (ul, v2,..., v") , w> for vi (i= 1, 2,..., a) being distinct members of I' (a is called the size of e) and w is an (arbitrary) label. The degree of a vertex is the number of distinct edges on which it is incident. See (Berge, 1973) for standard terminology for hypergraphs. THEOREM 5.4. It requires Q(log log n/log d) depth to compute the PARITY function of n bits by an NPRIORITY( 1) with ROM und with branching factor d and n processors.
Proof. Again, we prove that the lower bound holds even if the NPRIORITY( 1) only uccepts the hznguage of even-parity strings. The output is presented in C( 1). the only shared memory cell, when M stops. The input is presented in the ROM. For each input X= (x,, .y*,..., .xn) of even parity, we fix one accepting computation COMPX. A computation COA4PX is characterized as triples Wkk = tik, (y,, yz,..., ~~1, (4, A,..., dk)) for k = 1, 2,..., T, which have the following meaning. At the kth step of the computation COMPX, trip/ek is used. Processor ik read input bit yi (from triplek), took branch d, (from triplek) at step j for 1 < j< k and at the kth step either ik = 1 and no processor writes, or P(ik) succeeds in writing. This delinition of a computation is similar to that given in Theorem 5.1. Notice that for a fixed computation on X, COMPX uniquely determines the history of this computation and the addresses accessed by P( ik) up to step k for k = 1, 2,..., T. In the following, for each input X of even parity, we consider only a lixed accepting computation COMPx. Similar to the claim in Theorem 5.1, we have the following claim. where f(n, d, k) = 2nk fi (2d)j, ,=l inputs of even parity having the same accepting computation COMP up to step k. That is, the first k triples of the accepting computations fixed to the inputs in Gk are the same.
Proof The proof is similar to the proof of the claim in Theorem 5.1. There is only one difference. In this proof, we have a ROM to store the input, and every processor can look at many bits of the input. This is why we have a vector of ( yl, .v*,..., Ye). Up to step k, we have k triples. A triple at step j, i <j< k, can have ~2j dJ values. There are 2H ' even-parity inputs. So there is a group Gk of inputs that have the same COMP up to step k of size at least
where f(n, d, k) = 2nk flf= I (2d)j. B Up to here the proof has been similar to Theorem 5.1. A naive approach would be to try to claim that if two inputs of even parity have the same computation, then, as in Theorem 5.1, we can change the parity of one input (by changing a bit) without changing the computation. But observe that we cannot change a bit as freely as before because the input bits are in the ROM which can be read by many processors. This is why this proof turns out to be much more complicated than the proof of Theorem 5.1.
W.l.o.g., we assume that all inputs take precisely T steps. Now fix the COMP of depth T with the maximum CT as in Claim 1. So all inputs in G* have computation COMP, By fixing COMP we also fixed the history H, . Hz ,..., HT, where H== (II, u). We construct a labeled hypergraph HG = (V, E) as follows:
(1) V= {p, 1 p, is a position in ROM holding an input bit}. 1 Vi = H.
(2) ,!?= {W> qzv.., qc,L (~1. w'?,..., ~>~)j, where u = (24j 1 there is a processor I'(Z) such that with the fixed history HI, Hz ,..., Hj.. , , (q ,,..., qJ is the collection of all ROM positions from which P(1) can read up to step j on any input using any branch numbers through step j, and such that if position qj contains ~9, (1~ i < u), P(l) has no choice but to change COMP by succeeding in writing a value different from Hj at step j}.
The purpose of each hyperedge is to state that certain inputs with "bad" bit combinations in some locations would force some processor to change COMP. The size of the edges at step j is at most (24j because at most (2# positions in ROM can be possibIy reached by one processor. W.i.o.g., we assumed that Q = (24'. We have 1 VfIee[ 2 n/2. W.l.o.g., we assume 1~~~~~1 = n/2. Let So 1 VfiXl = n/2. Now we assign values to the locations in VfiX in such a way that they are consistent with as many members of GT as possible. Among the 22"' possible assignments, we choose the assignment, F, that is consistent with a maximum number of members in Gr.. By Claim 1, there are at least 2fl/J(q d, T) inputs in GT. Hence there is an assignment F: Vfix --, {O, 1 } of values to the positions in Vfix such that there are at least 2"j2/f(n, d, T) even-parity inputs in GF, where Gp is the collection of all even-parity inputs in Gr-which are consistent with F. Also let ZF be the set of all inputs of even parity which are consistent with F.
We now localize the edges and restrict ZZG to have the vertex set Vfree. Deline where El F is delined by the rules: (a) If (2, G) E E and no position lixed by F (i.e., in Vfix) belongs to t?, then (g,@)~Ej~.
(b) For (2, %) = ((e ,,..., e,), (w!, ,..., We)), if the locations ei, ,..., eic are fixed by F, then (i) if F fixes e,, to wti for j= 1, 2 ,..., c, then we create a new hyperedge (3, @') by simply deleting the components e ,,,..., eic, w ,,,..., ~1,~ from (2, @) and put (E', @') into El,:;
(ii) otherwise, do noting. That is, throw away the hyperedge (t?, G).
(c) Only the hyperedges constructed in (a) and (b) are in El P.
From the above discussion, we conclude, CLAIM 3. (i) Each node in Vfree has a degree at most deg( T, d) in the new graph fW vfee i (ii) With the assignment F, among the total 2ni2 inputs oj even parity that are consistent with F, at least 2""/f(n, d, T) of them cause COMP.
We now bound the number of inputs in ZF which cannot have a computation COMP (and hence are not in GF). Note that only those inputs that are inconsistent with all the hyperedges may have a computation COMP. Consider HGI vfrce. If there is a node in Vfree that has degree zero, which means that there is no restriction on this position p at all, then this bit can be either 1 or 0 without affecting the COMP, and as in Theorem 5.1, we can reach a contradiction easily. Therefore we assume that every one of the nj2 nodes has some hyperedges incident on it. So there are (4) J+ J<w>.
This process can be repeated at least g(n, d, T) = n/2 deg(7', d) len2 (T, d) times: Since each edge can be adjacent to at most len( T, d) nodes and each node has degree no more than deg(T, d), only deg( T, d) len (T, d) hyperedges can be deleted each time.
Each time the above process is repeated, a hyperedge disjoint from all the previously chosen hyperedges is chosen. So at least of the remaining inputs in J cannot have a computation COMP. That is, J is reduced by a factor of r( T, d). Therefore up to step T, we have at most inputs that may have a computation COMP. But by Claim 3, this must be greater than or equal to 2n'2/2n '( 2d) ". where g(n, d, T) = n/2 deg( T, d) len2( T, d).
Now it is not hard to see that for a tixed d T cannot be a constant. Taking logarithm on both sides of (1) we obtain, g(n,d,T)log(2,~~;;,;l)<l+Tlogn+T210g2d. 
From (2) and (3) 
Since len( T, d) = (2d)7, deg( T, d) < 2i'2("'7(2d)'), and g(n, d, T) = n/2T2 '2"'(2d)", the LHS of (4) is greater than or equal to logn-log T-(2d)7-3Tlog2d-2.
Combining (4) and (5), we must have, for some C, logN2(2d)~+0(log[T'log~logd])+logT+3Tlog2d+2 < C(2d)7 (assuming 2' > log log n) T = Q( log log n/log d). 1
CONCLUDING REMARKS
We have proved separation results between parallel models with ROM. As mentioned in (Vishkin and Wigderson, 1985) , input availability can affect the complexity of problems and if we assume that the inputs are given in a ROM, they are available to all the processors and then we can concentrate on the communication among the processors. We also treat nondeterministic models. Many lower bounds for deterministic models allow in each step an arbitrary amount of local computation (i.e., no shared memory cells accessed) by each processor. The reason for doing so is that the model used satislies the minimal set of requirements, and the lower bounds still hold without restricting the power of each RAM or requiring some uniformity. For nondeterministic models, an arbitrary amount of local computation at each basic step may result in an unbounded branching factor (the private computation is a binary tree of unbounded depth). Since any realistic model would have only a fixed number of nondeterministic choices in one step, it defmitely makes sense to bound the branching factor by a constant, while still allowing each branch to be arbitrarily long as long as it does not hurt the lower bound. A model with an unbounded branching factor is unrealistic and each processor can, in one step, guess an unbounded number of bits. On such a model, Q~Y @r&on can be computed on an NCOMMON( 1) without ROM in one step, by having one processor guess all the input in one step and write it into the shared memory, and then any processor whose input is not consistent with this guess objects.
As for deterministic models, an obvious open question is to show that COMMON(f(n)), ARBITRARY(f(jr)), and PRIORITY(f(n)) have different powers for any f(n). In a subsequent paper (Li and Yesha, 1986) we partially answer this question by showing that with inputs in ROM, COMMON(n'), ARBITRARY(jz'), and PRIORITY(n') are different for .s < 1. However, these separations are not as tight as the separations for ROM models with one shared-memory cell in Theorems 3.2 and 4.1, or as the separations for models without ROM with & shared memory cells in (Fich er ul., 1985) .
