Trahan, J.L., M.C. Loui and V. Ramachandran, Multiplication, division and shift instructions in parallel random access machines, Theoretical Computer Science 100 (1992) l-44.
We prove that polynomial time on a parallel random access machine (PRAM) with unit-cost multiplication and division or on a PRAM with unit-cost shifts is equivalent to polynomial space on a Turing machine (PSPACE). This extends the result that polynomial time on a basic PRAM is equivalent to PSPACE to hold when the PRAM is allowed multiplication or division or unrestricted shifts. It also extends to the PRAM the results that polynomial time on a random access machine (RAM) with multiplication is equivalent to PSPACE and that polynomial time on a RAM with *A preliminary version of a portion of this work appeared in the proceedings of the 22nd 
Introduction
An important model of parallel computation is the parallel random access machine (PRAM), which comprises multiple processors that execute instructions synchronously and share a common memory. Formalized by Fortune and Wyllie [9] and Goldschlager [lo] , the PRAM is a much more natural model of parallel computation than older models such as combinational circuits and alternating Turing machines [19] because the PRAM abstracts the salient features of a modern multiprocessor computer.
The PRAM provides the foundation for the design of highly parallel algorithms [14] . This model permits the exposure of the intrinsic parallelism in a computational problem because it simplifies the communication of data through a shared memory.
In this paper, we study the effect of the instruction set on the p-rformance of the PRAM. The basic PRAM has unit-cost addition, subtraction, Boolean operations, comparisons, and indirect addressing. To quantify differences in computational performance, we determine the time complexities of simulations between PRAMS with different instruction sets. We focus on the computational complexity of simulations between enhanced PRAMS with the following additional unit-time operations:
multiplication, division, arbitrary left shift, arbitrary right shift. Further, to better understand the effects of parallelism in the PRAM, it is necessary to view the model in relation to the sequential RAM. We bound the time for an enhanced RAM to simulate a similarly enhanced PRAM. Thus, we study enhanced PRAMS by stripping away one feature at a time, and we are able to observe the gain in time-bounded computational power due to individual features. Let PRAMCop] denote the class of PRAMS with the basic instruction set augmented with the set op of instructions.
Let PRAM [ op]-TIME( T(n)) denote the class of languages recognized by PRAM[op]s in time O(T(n)) on inputs of length II, PRAM Cop]-PTIME the union of PRAM [ op]-TIME( T(n)) over all polynomials T(n), and PRAM [ op]-POLYLOGTIME the union of PRAM [ op]-TIME( T(n)) over all T(n) that are polynomials in log II. Let RAM Cop], RAM [ op]-TIME( T( n)), RAM Cop]-PTIME,
and RAM Cop]-POLYLOGTIME denote the analogous classes for the sequential random access machine (RAM) model. We prove that polynomial time on a PRAM with unit-time multiplication and division or on a PRAM with unit-time unrestricted shifts is equivalent to polynomial space on a Turing machine (TM). Consequently, a PRAM with unit-time multiplication and division and a PRAM with unit-time unrestricted shifts are at most polynomially faster than a standard PRAM, which does not have these powerful instructions. These results are surprising for two reasons. First, for a sequential RAM, adding unit-time multiplication (*) or unit-time unrestricted left shift (r) seems to increase its power:
RAM-PTIME = PTIME [6] , RAM[ *I-PTIME = PSPACE [12] , RAM [ f ]-PTIME = PSPACE [ 15, 221, whereas adding one of these operations to a PRAM does not increase its power by more than a polynomial in time. Second, despite the potential speed offered by massive parallelism, a sequential RAM with unit-cost multiplication or unrestricted shifts is just as powerful, within a polynomial amount of time, as a PRAM with the same additional operation. We establish the following new facts about PRAMS. Recall that PSPACE = PRAM-PTIME and POLYLOGSPACE = PRAM-POLYLOGTIME [9] . Let 1 denote unrestricted right shift. and PRAM[T, J] may be viewed as "doubly parallel". The results in (1) are therefore also significant in that introducing unbridled parallelism to a random access machine with unit-time multiplication or unit-time unrestricted shift decreases the running time by at most a polynomial amount.
The results in (2) are notable because of their possible implications for the robust class NC, which can be characterized by several different models of parallel computation [S] . is polynomially related to time on the PRAM Cop], since the class of languages accepted in polynomial time on any of these models is equivalent to PSPACE [12, 221 . We obtain tighter time bounds by the simulations given here. The simulations are performed through uniform, bounded fan-in circuits. We prove that a RAM [ op] can efficiently simulate a uniform, bounded fan-in circuit and then show that the circuits that simulate a PRAMCop] meet the uniformity conditions. In another paper [26] , we combined multiplication and shifts in the instruction set and proved NEXPTIME c PRAM [ *, t, I]-PTIME c EXPSPACE. Thus, the combination of multiplication and shifts is more powerful, to within a polynomial in time, than either instruction separately.
The thesis by Trahan [25] gives the detailed proofs of lemmas for which only outlines of proofs appear in this paper.
Definitions and a key lemma

PRAM definitions
We study a deterministic PRAM similar to that of Stockmeyer and Vishkin [24] .
A PRAM consists of an infinite collection of processors PO, PI, . . . , an infinite set of shared memory cells, c(O), c (l) , . , and a program which is a finite set of instructions labeled consecutively with 1,2,3, . . . . All processors execute the same program. Each processor has a program counter. Each processor P,,, has an infinite number of local registers: r,(O), r,(l), . . . Each cell c(j), whose address is j, contains an integer con(j), and each register r,(j) contains an integer rcon,(j).
For convenience we use a PRAM with concurrent read and concurrent write (CRCW) in which the lowest numbered processor succeeds in a write conflict. Since we are concerned with at least polylog time, there are no significant differences between the concurrent read/concurrent write (CRCW), the concurrent read/exclusive write (CREW), and the exclusive read/exclusive write (EREW) PRAMS because the EREW model can simulate the CRCW model with a penalty of only a logarithmic factor in time (log of the number of processors attempting to simultaneously read or write) [S, 271. If one or more processors attempt to read a cell at the same time that a processor is attempting to write the same cell, then all reads are performed before the write. Initially, the input, a nonnegative integer, is in c(O). For all m, register r,(O) contains m. All other cells and registers contain 0, and only PO is active. A PRAM accepts its input if and only if PO halts with its program counter on an ACCEPT instruction.
In time O(log n), a processor can compute the smallest n such that con(O)d2"-1; the PRAM takes this y1 as the length of the input. Whenever con(i) is interpreted in two's complement representation, we number the bits of con(i) consecutively with 0, 1,2, . . . , where bit 0 is the rightmost (least significant) bit.
We allow indirect addressing of registers and shared memory cells through register contents. The notation c(r,(j)) refers to the cell of shared memory whose address is ycon,,,( j), and y(r,( j)) refers to the register of P, whose address is rcon,( j).
The basic PRAM model has the following instructions. When executed by processor P,,,, an instruction that refers to register r(i) uses r,(i).
r(i) t k (load a constant) r(i)+r( j) (load the contents of another register) r(i)+c(r( j)) (indirect read from shared memory) c(r(i))+-r( j) (indirect write to shared memory) r(i)+-r(r(j)) (indirect read from local memory) r(r(i))cr( j) (indirect write to local memory) Processor PO can perform a FORK operation only once. This restriction is necessary to prevent the activation of multiple processors with identical processor numbers. This is also the reason why Pm halts when it performs a FORK. With the FORK instruction, at most 2' processors are active at time t in a computation of a PRAM. In some variants of the PRAM model, the input is initially located in the first II cells, one bit per cell. We therefore have the instruction "r(i)+-BIT(r( j))" in order to allow the PRAM to transform the input to this format in O(log n) time. This instruction was also used by Reischuk [lS] .
For an integer d, define its length lea(d) as the minimum integer w such that -2" -1 <d < 2"-1 -1. Thus, d has a two's complement representation with w bits. Let w = max { len( rcon,( j)), len(rcon,( k)) >. Let # d denote the two's complement representation of d. To perform a Boolean operation on rem, (j) and rcon,(k), the PRAM performs the operation bitwise on the w-bit two's complement representations of rcon,( j) and rcon,(k). The PRAM interprets the resulting integer x in w-bit two's complement representation and writes x in r,(i). We need at least w bits so that the result in correctly positive or negative.
Let us assume that the division instruction returns the quotient. Let 7 (I) denote the unrestricted left (right) shift operation:
(the integer part of rcoIz,( j) -+-2*conm(k)) into r,(i). The instruction can also be viewed as placing into r,(i) the result of shifting the binary integer rcon,( j) to the left (right) by rcon,(k) bit positions.
At each step, each active processor simultaneously executes the instruction indicated by its program counter in one unit of time, then increments its program counter by one, unless the instruction causes a jump. On an attempt to read a cell at a negative address, the processor reads the value 0; on an attempt to write a cell at a negative address, the processor does nothing. The assumption of unit-time instruction execution is an essential part of our definition. In a sense, our work is a study of the effects of this unit-cost hypothesis on the computational power of time-bounded PRAMS as the instruction set is varied. For ease of description, we sometimes allow a PRAM a small constant number of separate memories, which can be interleaved.
This allowance entails no loss of generality and only a constant factor time loss.
in PRAMS 7
A PRAM Z has time bound T(n) if for all inputs u of length n, a computation of Z on o halts in T(n) steps. Z has processor bound P(n) if for all inputs cx of length ~1, Z activates at most P(n) processors during a computation on o. We assume that T(n) and P(n) are both time-constructible in the simulations of a PRAM [op] by a PRAM, so that all processors have values of T(n) and P(n). Let R be a PRAM[*]. By repeated application of the multiplication instruction, R can generate integers of length O(n2r(")) in T(n) steps. By indirect addressing, processors in R can access cells with addresses up to 2 n27'"' in T(n) steps, although R can access at most O(P(n) T(n)) different cells during its computation.
In subsequent sections, these cell addresses will be too long for the simulating machines to write. Therefore, we first construct a PRAM [ *] R' that simulates R and uses only short addresses. Similarly, a PRAM [ 1, J] can generate extremely long integers and use them as indirect addresses, so we simulate this by a PRAM[ t, J] that uses only short addresses.
Associative Memory Lemma. Let op _ { c *, t, t, 1). For all T(n) and P(n), every language recognized with P(n) processors in time T(n) by a PRAMCop] R can be recognized in time 0( T(n)) by a PRAM [op] R' that uses 0(P2(n) T(n)) processors and accesses only cells with addresses in 0, . . . , O(P (n) T( n)).
Proof. Let R be an arbitrary PRAM [ op] with time bound T(n) and processor bound P(n). We construct a PRAMCop] R' that simulates R in time 0( T(n)) with P2(n)T(n)processors,butaccessesonlycellswithaddressesinO,...,O(P(n)T(n)).R' employs seven separate shared memories: memI, . . .,mem,. Let cr,(k) denote the kth cell of memb and con,(k) the contents of that cell. R' organizes the cells of meml and mem2 in pairs to simulate the memory of R: the first component, c,(k), holds the address of a cell in R; the second component, c2(k), holds the contents of that cell.
Actually, in order to distinguish address 0 from an unused cell, c,(k) holds one plus the address. Let pair(k) denote the kth memory pair. R' organizes the cells of mem3, mem4, and mem5 in triples to simulate the local registers of R: the first component, c3( k), holds the processor number; the second component, c4( k), holds the address of a register in R; the third component, c5(k), holds the contents of that register. Let triple(k) denote the kth memory triple. Since R can access at most O(P(n)T(n)) cells in T(n) steps, R' can simulate the cells used by R with O(P(n)T(n)) memory pairs and triples. R' uses memories mem, and mem, for communication among the processors.
Let P, denote processor m of R; let Pk denote processor m of R'. We now describe the operation of R'. In O(logP(n)) steps, R' activates P(n)
processors, called primary processors. In the next log(P(n) T(n)) steps, each primary processor activates P(n)T(n) secondary processors, each of which corresponds to a memory pair and a memory triple. Primary processor Pk corresponds to the processor of R numbered (m/P(n) T(n)) -P( n). The processors numbered m + k, for all k, 0 < k < P( n) T(n) -1, are the secondary processors belonging to primary processor P 6. Each secondary processor Pi belonging to P',, j = m + k, handles pair(k) and triple(k). We call k the assignment number of Pi. Pi computes its assignment number in constant time.
Observe that if i<m and Pi and PI, are primary processors, then the processor of R to which Pi corresponds is numbered lower than the processor of R to which PI, corresponds, and all secondary processors belonging to Pi are numbered lower than all secondary processors belonging to Pk. We exploit this ordering to handle concurrent writes by processors in R.
Suppose R' is simulating step t of R in which P, writes v in c(f). Then the corresponding primary processor Pk of R' writesf+ 1 into c1 (P(n)( T( n)-t)+g + 1) and v into c2( P( n) ( T( n) -t) + g + 1). That is, at step t of R, all primary processors of R'writeonlycellswithaddressesinP(n)(T(n)-t)+l,...,P(n)(T(n)-~+l),withthe lowest-numbered primary processor writing in the lowest-numbered cell in the block. The memory holds a copy every time a processor attempts to write c(f). By this ordering, the copy of a cell in R with the current contents (most recently written by lowest-numbered processor) is in a lower-numbered cell of memz of R' than each of the other copies. The secondary processor that handles this current copy is lower-numbered than each of the secondary processors handling other copies. If at some later step a primary processor P& desires to read con(f) of R, then its secondary processors read all copies of con(f) and concurrently write their values in c, (m) . By the write priority rules in which the lowestnumbered processor of those simultaneously attempting to write a cell succeeds, the secondary processor reading the current value of con(f) succeeds in the write.
Similarly, suppose R' is simulating a step of R in which P, writes v in r,(f). Then P:,writesgin~~(P(n)(T(n)-t)+g+l),f+linc,(P(n)(T(n)-t)+g+l),andvin c5 (P (n)( T(n) -t) + g + 1). If at some later step P :, desires to read rcon,( f), then its secondary processors read all copies of rcon, (f) and concurrently write their values in CT(m).
When a processor P, of R executes an instruction r(i)+r( j) 0 r(k), it reads rcon,( j) and rcon,( k), computes v := rcon,( j) 0 rcon,( k), and writes v in rs( i). The corresponding processor Pk of R' simulates this step as follows. Using mem6 and mem7 to communicate with its secondary processors and exploiting the write priority rules, Pk copies rcon,( j) of R to r, (l) and rcon,(k) of R to r,(2). Ph then computes v := rcon,(l) 0 rcon,(2), writing v in r,( 1). Next, if i is negative, then P k does nothing. Otherwise, suppose R' is simulating step t of R. Each primary processor keeps track of t in its local memory. Then Pk writes g in c3(
, and v in c,(P(n)(T(n)-t)+g+ 1) to complete the simulation of step L Thus, R' uses a constant number of steps to simulate a step of R and only
Observation I: R' needs only addition and subtraction to construct any address that it uses.
Observation 2: Each processor of R' uses only a corrstant number of local registers.
Hagerup [11] proved a result for the same problem as the Associative Memory Lemma, but he fixes the number of processors and lets the time grow. Let S(n) denote the highest-numbered memory cell used in a computation of PRAMCop] R, and let B(n) be any function such that B(n)32 and is computable with the resources given below. Specifically, Hagerup proved that for all T(n), P(n), and S(n), every language recognized with P(n) processors in time T(n) with memory cells numbered at most
T(n)) by a PRAM Cop] R' that uses P(n) processors and accesses only cells with addresses in 0, . . . . O(P(n)Un)B(n)).
Circuit de$nitions
We use the following definitions relating to circuits [19] . l A circuit family C is a set ( C1 , Cz, . . . > of circuits, where C, has n inputs and one output. We restrict the gate numbering so that the largest gate number is (Z(n))'('), where Z(n) is the size of C,. Thus, a gate number coded in binary has length O(log Z(n)).
l A bounded fan-in circuit is a circuit where the indegree of all gates is at most 2. For each gate g in C,, let g(k) denote g, g(L) denote the left input to g, and g(R) denote the right input to g. If C, has at most Z(n) gates and depth o(n), then the size complexity of C is Z(n) and the depth complexity is D(n).
l An unboundedfan-in circuit is a circuit where indegree is unbounded. For each gate g in C,, let g(i) denote g, and let g(p), p=O, 1,2, . . . , denote the pth input to g. If C, has at most Z(n) wires and depth D(n), then the size complexity of C is Z(n) and the depth complexity is D(n). l The family C = ( C1, Cz, . . .} of bounded (unbounded) fan-in circuits of size Z(n) is VIM-ungorm if there is a RAM [ 7, J] that on input I(Z(n)) returns an output string in 0 (log Z(n)) time indicating for each pair (g, h) whether ( n, g, L, h) is in LBDC and whether (n, g, R, h) is in LBDC (indicating for each pair (g, h) the value of p such that l A gate g is at leuelj of C, if the longest path from any circuit input to g has lengthj.
Gate g is at height j of C, if the longest path from g to the output has length j.
l Let C, be a bounded fan-in circuit consisting entirely of AND, OR, and inp gates with depth D(n). We construct the circuit CT(C,), the circuit tree of C,, from C,. Let gate a be the output gate of C, and let a be of type @E{AND, OR) with inputs from gates b and c. Then the output gate of CT( C,) has name (0, a), type 4, and inputs from gates named (1, b) and (2, c). Thus, gate (0, a) is the gate at height 0 of CT(&) and gates (1, b) and (2, c) are the gates at height 1 of CT(C,). Now suppose that we have constructed all gates at height j of CT( C,), and we wish to construct the gates at heightj + 1. Each gate (i, e) at heightj corresponds to a gate e in C,. If e is of type 4 E {AND, OR), then gate (i, e) is of type 4. Suppose gate e has inputs from gates fand g. Then the inputs to gate (i, e) of CT( C,) at height j + 1 are the gates (2i+ 1,f) and (2i+2, g). If gate e is of type inp, that is, an input, and j < D(n), then (i, e) is of type OR (if (i, e) is at an even-numbered level) or type AND (if (i, e) is at an odd-numbered level), and the inputs to gate (i, e) at height j + 1 are the gates (2i + 1, e) and (2i + 2, e). If gate e is of type inp and j = D(n), then (i, e) is of type inp and CT(&) has no gates at height j+ 1 connected to gate (i, e). Figure 1 contains an example of a circuit tree. Note that in a double-rail circuit every gate has exactly two inputs, except the input gates. l A layered circuit is a double-rail circuit such that all gates at level i, for all odd i, are AND-gates and all gates at level i, for all even i, are OR-gates, and each input to a gate at level i is connected to an output of a gate at level i-1.
Lemma 2.1 (Trahan [25]). Let C = ( C1, Cz, . . . } be a VM-uniform (MRAM-unifarm) family of bounded fan-in circuits of size Z(n) and depth D(n) recognizing language L. There exists a family of VM-uniform (MRAM-uniform), boundedfan-in, layered circuits
F={F,,F,, . ..> f o size O(Z(n)) and depth 0( D(n)) recognizing language L.
Simulation of uniform circuit by RAMCop]
In this section, we restructure a uniform, bounded fan-in circuit, then simulate the restructured circuit on a RAM Cop]. We first describe the simulation of a VM-uniform circuit on a RAM [ 7, 11, then to balance the time spent generating the circuit with the time spent running the circuit.
Simulation of VM-uniform circuit by RAM[ t,L]
Let C={C1,Cz,... } be a VM-uniform family of bounded fan-in circuits of size .Z( n) and depth D(n) recognizing language L. We now describe how a RAM [ T, J] can simulate C.
Simulation. Fix an input length n. Circuit C, has size Z(n) and depth
By Lemma 2.1, there exists a VM-uniform layered circuit F, with size 0( Z( n)) and depth O(D(n)) that recognizes language L ("I. Machine R simulates C, via F,. For simplicity, let us say that F, has depth D(n) and size Z(n), and that all gates of F, are numbered from {0, 1, . . . . Z(n)-l}. Let us first outline the simulation. Stage 1. R generates a Z(n) x Z(n) ancestor matrix A in which each entry (g, h) indicates whether gate h is an input of gate g in F,.
Stage 2. R obtains matrix AlogZcnj, the distance logZ(n) ancestor matrix. 
(i).
Matrix G is stored in a single register in row major order. We view the contents of this register both as a matrix of discrete elements and as a single bit string. We call the portion of a matrix comprising one row a box. We call the portion of a box containing one element of a row a slot.
We introduce a simple procedure COLLAPSE, which R uses to extract information from matrix G. Procedure COLLAPSE(n, z) takes as input the value CI, where #c(=cx,2_1 . ..arcr., and returns the value /?, where #p=/122_1 . ..filpO. and bits ljkz=V3=C+%zfj> for 0 <k < z -1, and /3, = 0 if i # kz. R can perform COLLAPSE (a, z)
in O(log z) time by shifting and ORing, then masking away all bits $i for i # kz. Let out denote the name of the output gate of F,. Nonzero entries in row out of G correspond to the ancestors of gate out at distance log Z(n); that is, the gates at the boundary between C( v -1) and C( v -2). To extract CT( C(v -l)), R masks away all but row out of G. Let S(v-1) denote this value.
In general, assume that we have S(i+ l), and we want to extract S(i) from G. First, R computes the OR of all boxes of S( i+ 1). Let q(i) denote this value. R computes 
(i). Each S(i) contains a description of CT(z(i)). (Note: CT(C(i)) is a collection of circuit trees, one for each output of C(i).) In Stage 4, R runs each CT(C(i))
in sequence. R begins by manipulating the input o to be in the form necessary to run on CT(C(0)) by SPREAD (#o,Z(n)).
The input o to F, is 2n bits long (n input bits and their complements).
We describe how R runs the circuit by slices. We must take the output $(i) from C(i) and convert it into the form needed for the input to CT(.X(1'+ 1)). We let o(i) denote this input.
Let us define a function 
The result p( i -1) has Z(n) output bits from $( i -1). These are Z(n) bits apart, one per slot in a single box. Now R concatenates Z(n) copies of $'(i-1); call this $"( i-1). Each element is Z2(n) bits long, the length of a box. R computes S(i)'=S(i) A Ic/"(i-1); hence, #S(i)' has a 1 in position jZ2(n)+kZ(n)+l if gatej is at the bottom of C(i), gate k is the Ith ancestor ofj at the top of C(i), and the input to gate k is a 1. R ORs all slots in each box together in O(logZ(n)) steps, producing bit vector o(i). By our construction, w(i) is the input to CT( C(i)). Recall that CT(C(1)) consists of alternating layers of AND and OR gates. We run CT(
steps by shifting, then ANDing and ORing. It takes time O(logZ(n)) to manipulate the output from one slice into the form needed for the input to the next slice and time O(logZ(n)) to run a slice. Since there are D(n)/logZ(n) slices, its takes time O(D(n)) to run a circuit on the input, given the distance log Z( n) ancestor matrix G. 
Proof. We construct R by the method described above. For fixed n, R simulates C, via F, in O(log Z(n) log log Z(n)) time to create matrix G, then 0( D( n)) time to run F, on input w, given G. Thus, the overall time is 0( D( n) + log Z( n) log log Z( n)) steps. 0
Simulation of A&RAM-uniform circuit by RAM [ *]
In this section, we adapt the simulation of a VM-uniform circuit by a RAM [ 7, J] R that recognizes
Proof. Without loss of generality, assume that R has two memories: meml and mem,. R performs the simulation described in Section 3.1, using a precomputed To perform a right shift by j bits, R shifts all other values in meml left by j bits, then notes that the rightmost j bits of all registers are to be ignored [12] . This takes constant time because, by reusing registers, R uses only a constant number of registers in meml. In O(logZ(n)) time, R computes the values 2'(") and 2Z2(n), since Z(n) and Z'(n) are the basic shift distances. In the course of the computation, R performs shifts by Z(n).2', 06 i<logZ(n), for each value of i. R computes the necessary shift value on each iteration from the previous value. Thus, the simulation by R takes the same amount of time as the simulation described in Section 3.1: O(D(n)+logZ(n)loglogZ(n)). 
Simulation of PRAM[*]
by PRAM Let R be a PRAM [*] operating in time T(n) on inputs of length n and using at most P(n) processors.
Let R' be a PRAM[s] that uses only short addresses and simulates R according to the Associative Memory Lemma. Thus, R' uses
processors, 0( T(n)) time, and only addresses in 0, 1, . . .
. O(P(n)T(n)
). Each processor of R' uses only 4 registers, where CJ is a constant.
We construct a PRAM Z that simulates R via R' in 0( T2(n)/log T(n)) time, using O(P"(n)F(n)n 4 2 'tn) log T(n)) processors.
We view Z as having 4 +4 separate shared memories : memo, , memq + 3. Our view facilitates description of the algorithm to follow. The idea of the proof is that Z stores the cell contents of R' with one bit per cell and acts as an unbounded fan-in circuit to manipulate the bits.
Initialization.
Z partitions memq into 0( P( n) T(n)) sections of n2 rCn) cells each. Let S(i) denote the ith section. A section is sufficiently long to hold any number generated in T(n) steps by R', one bit per cell, in n2 'cnJ-bit two's complement representation. We are now prepared to describe the simulation by Z of a general step of R'. Consider a processor P, of R' and the corresponding primary processor P, of Z. The actions of P, and its secondary processors depend on the instruction executed by P, of R'. P, notifies its secondary processors of the instruction.
The following cases arise.
r(i)tr( j)+r(k):
Chandra et al. [3] gave an unbounded fan-in circuit of size O(x(log* x)") and constant depth for adding two integers of length x. Stockmeyer and Vishkin [24] proved that an unbounded fan-in circuit of depth D(n) and size S(n) can be simulated by a CRCW PRAM in time O(I)(n)) with O(n+S(n))
processors. By the combination of these two results, the secondary processors perform addition in constant time with their concurrent write ability. This addition requires 0(n2T'"1(log*(n2T'"'))2) processors.
r(i)tr( j) A r(k):
The secondary processors perform a Boolean AND in one step.
Other Boolean operations are performed analogously.
r(i)tr( j)-r(k):
The secondary processors add rcon,( j) and the two's complement of rcon,(k). This takes constant time.
Comparisons (CJUMP r(i) > r( j), label):
For 1 <k < n2T("), the secondary processor of the first section that normally handles the kth cell of the section handles the ( n2T(") -k + 1) th cell. Thus, the lowest-numbered processor reads the most significant bit. Each secondary processor allocated to the first section compares corresponding bits of B,(g) and Bj(g), then writes the outcome of the comparison only if the bits differ. By the CRCW priority rule, after the secondary processors write concurrently, the value written corresponds to the most significant bits at which the operands differ. Thus, the outcome of rcon,( i) > rcon,( j) is determined by employing the concurrent write rules of the PRAM. Other comparisons are performed analogously and all comparisons can be simulated in constant time. Proof. For inputs of length y, Schtjnhage and Strassen [20] gave a multiplication algorithm that may be implemented as a logspace-uniform bounded fan-in circuit with depth O(log y) and size O(ylog y log logy). Chandra et al. [4] Memory Lemma, R' accesses only addresses of length O(log P(n) T(n)). If P, wishes to perform an indirect read from c( y( i)), then P, and its associated processors perform a SQUASH on Bi(g) in time O(log log P(n) T(n)).
If processors Pf and P, of R' simultaneously attempt to write c(j), then the corresponding processors PI and P, of Z simultaneously attempt to write S(j) of mem,. If f< g, then 1< m, and all secondary processors of P1 are numbered less than all secondary processors of P,. Thus, in R', P, succeeds in its write, and in Z, P, and its secondary processors succeed in their writes.
Theorem 4.2. For all T(n) 3 log n,
Proof. According to the above discussion, Z simulates R via R'. Initialization takes O(log( P(n) T(n)) + T(n) + log n) = 0( T(n)) time. Z performs indirect addressing in O(log T(n)) time, multiplication in 0( r(n)/log T(n)) time, and all other operations in constant time. Thus, Z uses time 0( T(n)/log T(n)) to simulate each step of R'.
Z uses O(P*(n)T(n))
primary processors, each with 0(n24T("'T(n)log T(n)) se-
If T(n) = O(log n), then P(n) is a polynomial in n, and Z simulates R in time O(log2 n/log log n) with polynomially many processors. Thus, an algorithm running in time O(log n) on a PRAM [ *] is in NC*. If T(n) = O(logk n), then Z simulates R in time O(log2k nl(2kloglog n)) with 0(n2+410gk-'n. logZk n log log n) processors. So, our simulation does not show that an algorithm running in time O(logkn), k> 1, on a PRAM [*] is in NC because of the superpolynomial processor count. An interesting open problem is to show either that PRAM [ *I-POLYLOGTIME = NC by reducing the processor count to a polynomial or that NC is strictly included in PRAM [ *I-POLYLOGTIME by proving that the simulation requires a superpolynomial number of processors.
Simulations of PRAM [ *] by circuits and Turing machine
We now describe simulations of a PRAM [*I by a logspace-uniform family of unbounded fan-in circuits, a logspace-uniform family of bounded fan-in circuits, and a Turing machine.
Lemma 4.3 (Stockmeyer and Vishkin [24]). Let Z be a PRAM with time bound T(n), processor bound P(n), and word-length bound L(n). There is an unbounded fan-in circuit C, that simulates Z in depth O(T(n)) and size O(P(n)T(n)L(n)(L*(n)+P(n)T(n))).
Note: Minor changes are necessary in the simulation of Stockmeyer and Vishkin to account for differences between their PRAM definition and ours, but these cause no change in the overall depth or size of the simulating circuit. Stockmeyer and Vishkin presented the simulation of a nonuniform PRAM by a nonuniform family of circuits. For our PRAM definition, in which all processors share a constant size program, the simulating circuit is logspace-uniform.
Lemma 4.4. For each n and T(n)>,log n, every language recognized by a PRAM [*]
R in time T (n) with P(n) processors can be recognized by a logspace-unijorm, unbounded fan-incircuit UC,ofdeptkO( T2(n)/logT(n))andsizeO(nT2(n)32T(")(n2+T2(n))).
Proof. The depth bound follows from Theorem 4.2 and Lemma 4.3. We now establish the size bound. Let R' be the PRAM[ *] described in Theorem 4.2 that simulates R according to the Associative Memory Lemma, using O(T(n)) time with 0( P 2(n) T(n)) processors and word length 0( n2 T(n)). Fix an input length n. Let UC, be a logspace-uniform, unbounded fan-in circuit that simulates R' by the construction given by Stockmeyer and Vishkin [24] (Lemma 4.3), with one modification.
For each time step of R', we add to UC, a block of depth 0( T(n)/log T(n)) and size 0 ( n2 4 T(n) T(n) log T(n))
that handles multiplication (Lemma 4.1). Thus,
UC,, has depth O(T'(n)/log T(n)) and size 0(P(n)T(n)[L(n)(L2(n)+P(n)T(n))+ n24'@)T(n)logT(n)])=O(nT2(n) 32T("'(n2+T2(n))),
since P(n)<2r'"'. 0
Lemma 4.5. For each n and T(n) 3 log n, every language recognized by a PRAM [ *] R in time T(n) with P(n) processors can be recognized by a logspace-un$orm, bounded fan-in circuit BC, of depth 0( T2(n)) and size O(nT'(n) 32T(") (n2 + T2(n))).
Proof. Fix an input length n. Let UC,, be the unbounded fan-in circuit described in Lemma 4.4 
0(P"(n)~2(n))}=O(T2(n)4T'"').
Hence, these parts of the circuit can be implemented as a logspace-uniform, bounded fan-in circuit of depth 0( T(n)). The multiplication blocks may be implemented as logspace-uniform, bounded fan-in circuits of depth 0( T(n)) (Lemma 4.1). Let BC, be this bounded fan-in implementation of UC,.
Since P( n)<2 '@), BC, simulates each step of R' in depth 0( T(n)); hence, BC,
simulates R via R' in depth O(T2(n)) and size 0(nr2,(n) 32T'"' (nZ+T2(n))). q
Theorem 4.6 For all T(n)>logn, PRAM[*]-TIME(T(n)) G DSPACE(T2(n)).
Proof. Theorem 4.6 follows from Lemma 4.5 and Borodin's [2] result that a logspaceuniform, bounded fan-in circuit of depth o(n) can be simulated in space O(o(n)) on a Turing machine when D(n)=n(logn). 0
Simulation of PRAM [ *] by RAM [*I
In this section, we simulate a PRAM[*] by an MRAM-uniform, bounded fan-in circuit family, then simulate this circuit family by a RAM [ *]. We also simulate a basic PRAM by a RAM[*].
Let C = ( C1, C2, . . } be the family of unbounded fan-in circuits described in Lemma 4.3 that simulates a PRAM that runs in time T(n) with P(n) processors. We construct a family of bounded fan-in circuits BC' = { BC; , BC;, . . . } from C. The fan-in of any gate in C, is at most max(O(l(n)),O(P(n)T(n))}=max{O(n+T(n)), O(P(n)T(n))).
We replace each gate with fan-inf in C, by a tree of gates of depth logf in BC,. The depth of BCL is 0( T(n)(log P (n) T(n))), and the size is 0(P(n)T(n)L(n)(L2(n)+P(n)T(n))), the same size as C,.
Lemma 4.7. BC' is MRAM-uniform.
Proof. We first establish that the unbounded fan-in circuit C is MRAM-uniform, then establish that the bounded fan-in circuit BC' is MRAM-uniform.
Fix a PRAM Y and an input size n. The simulating circuit C, comprises T(n) 
is 0(P(n)L(n)(L2(n)+P(n)T(n))),
where L(n) is the word size of Y, and the total size of C, is T(n) times this amount.
The general form of a gate name is specified in Fig. 2 . 
., Z(n)'(l)}. To test connectivity,
R compares corresponding slots of #g and # h for all pairs (g, h) simultaneously.
R separates the pairs for which the comparison is true from the pairs for which the comparison is false by building an appropriate mask in time O(logZ(n)).
Thus, C is MRAM-uniform. A gate name in BCI, is the concatenation of the unbounded fan-in gate name in C, and the name of the gate within the bounded fan-in tree that replaces the unbounded fan-in gate (Fig. 3) .
We prove MRAM-uniformity by the same method as above, with modifications to test slot E, the portion of the gate name giving the gate name within the tree of depth logf: By this algorithm, we see that the family BC' of bounded fan-in circuits is MRAM-uniform since C is MRAM-uniform. 0
Theorem 4.8. For all T(n) >,log n, every language recognized with P(n) processors in time T(n) by a PRAM can be recognized by a RAM [*I in time O(T(n)logP(n)T(n)).
Proof. By Lemma 4.7, BC', the family of bounded fan-in circuits that simulates a PRAM, is MRAM-uniform. Let BC denote the family of bounded fan-in circuits described in Lemma 4.5 that simulates a PRAM [*] in depth O(T'(n)) and size O(nT'(n) 32T'")(n'~T2(n))).
By Theorem 3.4, a RAM [ ;k] can simulate BC' in time O(T(n)logP(n)T(n)
We construct a family of bounded fan-in circuits MC from BC. Fix an input size n. The circuit MC, is exactly the same as the circuit BC, except that MC,, uses a different multiplication block for reasons of MRAM-uniformity. Insert bounded fan-in circuits performing carry-save multiplication in MC,,. Each block has depth O(T(n)) and size O(n24T'"'). Th us, MC, has depth O(T2(n)) and size 0(nT2(n)32T'"'(n2+T2(n))).
Lemma 4.9. For each n, every language recognized by a PRAM [ *] R in time T(n) with P(n) processors can be recognized by bounded fan-in, MRAM-uniform circuit MC,, of depth 0(T2(n)) and size 0(nT2(n)32T(") (n2+T2(n))).
Proof. The proof of the PRAM [ *] simulation is similar to that given for Lemma 4.5. By an argument similar to the proof of Lemma 4.7, MC is MRAM-uniform. 0
Theorem 4.10. For all T(n)3 log n,
PRAM[*]-TIME(T(n))sRAM[*]-TIME(T'(n)).
Proof. By Lemma 4.9, a PRAM [*] running in time T(n) with P(n) processors can be simulated by a bounded fan-in, MRAM-uniform circuit MC, of depth 0( T 2 (n)) and sizeO(nT2(n)32T("J(n2+T2(n))).
ByTheorem3. 
The simulation of Theorem 4.10 is more efficient.
Division
In this section, we study the division instruction. We are interested in the division instruction for two reasons. First, division is a natural arithmetic operation. Second, Simon [23] processors, 0( T(n)) time, and only addresses in 0, 1, . . . , 0( P (n) T(n)).
Simulation of PRAM[+, +] and PRAM[ +] by PRAM
We begin by describing the simulation of a PRAM [ *, +] by a PRAM. The idea of the proof is that we modify the simulation of a PRAM [ *] by a PRAM (Section 4.1). Because this simulation depends on the relationship between circuits and PRAMS, we are interested in the Boolean circuit complexity of division. Beame et al. [l] developed a circuit for dividing two n-bit numbers in depth O(logn). This circuit, however, is polynomial-time uniform, and we need the stronger condition of logspace-uniformity. Reif [17] devised a logspace-uniform, depth O(log n log log n) division circuit, and Shankar and Ramachandran [21] improved the size bound of this circuit. 
Theorem 5.2. For all T(n) > log n, PRAM [ *, t]-TIME( T(n)) c PRAM-TIME( T'(n)).
Proof. By the simulation above, Z simulates each step of MD' in time 0( T(n)) with
We now present the simulation of a PRAM[t] by a PRAM. Let D be a PRAM [ +] that uses r(n) time and P(n) processors. We construct a PRAM Z that simulates D in time 0( T(n) log(n + T(n))). Z acts as a circuit to simulate the computation of D.
Simulation. We modify the simulation of a PRAM[*] by a PRAM from Section 4.1. In T(n) steps, a PRAM [ *] can build integers of length n2T("), whereas a
PRAM[ t] can build only integers of length O(n + T(n)). As a result, 2 partitions the memory into blocks containing
only O(n + T(n)) cells each. 2 activates P(n) primary processors, each with 0((1/d4)(n+ T(n))"') secondary processors. The simulation proceeds along the same lines as in Section 4.1 except for division instructions. By Lemma 5.1, 2 can perform a division in time O(log(n+ T(n))).
Theorem 5.3. For all T(n) 3 log n,
PRAM[+]-TIME(T(n))cPRAM-TIME(T(n)log(n+T(n))).
Proof. By the simulation above, 2 simulates D in time 0( T(n)log(n+ T(n))) with
0((P(n)/64)(n+T(n))'+6)
processors. 0
Simulation of PRAM [ *, +] and PRAM [ t] by circuits and Turing machine
Next, we consider the simulation of a PRAM[*, t] by circuits and a Turing machine. We construct a TM M that simulates MD via MD' in T'(n)log T(n) space by modifying the simulation of a PRAM [ $1 by a TM (Section 4.2). We need the following lemmas.
Lemma 5.4 (Shankar and Ramachandran [21]). A logspace-uniform, bounded fun-in circuit can compute the quotient of two x-bit operands in depth O(logx loglogx)
and size O((1/d4) x'+"),for any 6>0.
Lemma 5.5. For each n, every language recognized by a PRAM[*, t] MD in time T(n) can be recognized by a logspace-uniform bounded fan-in circuit DC,, of depth O(T'(n)log T(n)).
Proof. Fix an input length n. Let BC, be the logspace-uniform, bounded fan-in circuit described in Lemma 4.5 that simulates a PRAM [ *]. Let DC, be BC, with additional circuit blocks for division. To handle division instructions with operands of length at most x =n2T'"', we used the logspace-uniform O(logxloglogx) depth bounded fan-in division circuit specified in Lemma 5.4. Circuit DC,, is at most at constant factor larger in size than BC,. Hence, DC, uses depth 0( T( n) log T(n)) to simulate each step of MD. 0
Lemma 5.6. A logspace-uni$orm, unbounded fan-in circuit can compute the quotient of two x-bit operands in depth O(logx)
and size 0((1/d4) x2+"), for any 6>0.
Proof. Lemma 5.6 follows from Lemma 5.4 by the transformation due to Chandra et al. [4] from a bounded fan-in circuit to an unbounded fan-in circuit. The transformation preserves logspace-uniformity. 0
Lemma 5.7. For each n, every language recognized by a PRAM[ *,+I MD in time T(n) with P(n) processors can be recognized by a logspace-unzform unbounded fan-in circuit UD, ofdepth O(T'(n)) and size 0(nT2(n)32T'"'(n2+T2(n))).
Proof. Fix an input size n. Let UD be the logspace-uniform circuit UC,, of Lemma 4.4
with additional circuit blocks for division, using the circuits described in Lemma 5.6.
For operands of size at most x=n2 T(n) the depth of each division block becomes ,
0( T(n)).
Overall, the depth of UD, is 0 ( T2(n) ), and the size is the same as that of UC,. n Theorem 5.8. For all T(n)>log n,
PRAM[*,t]-TIME( T(n)) c DSPACE( T2(n)log T(n)).
Proof in PRAM-TIME( T(n)) [9] , we can obtain an 0( T2(n)log T(n)) time simulation of a PRAM [ *, +] by a PRAM. The direct simulation of Theorem 5.2 is more efficient. Through Theorem 5.2 and the simulation of PRAM-TIME(T(n)) in DSPACE( T2(n)) [9] , we obtain an 0( T4(n)) space simulation of a PRAM[*, +] by a TM. The simulation of Theorem 5.8 is more efficient. Let PC = {PCl, PC,, . . } be the family of bounded fan-in circuits that simulates the family C of unbounded fan-in circuits described in Lemma 4.3 (from [24] ). For a fixed input size n, the depth of PC, is 0( T(n)log P(n) T(n)) and the size is 0(J'(n)T(n)Un)(L2(n)+P(n)T(n))).
Theorem 5.9. For each n, every languuge recognized by a PRAM[ t] D in time T(n) with P(n) processors can be recognized by a logspace-uniform, bounded fun-in circuit DB, of depth O(T(n)logP(n)+ T(n)log(n+ T(n))loglog(n+ T(n))).
Proof. Fix an input length n. Let PC,, be the bounded fan-in circuit described above that simulates a PRAM. Let DB, be PC, with additional circuit blocks for division.
To handle division instructions with operands of length at most x=n+ T(n), we use the log-space uniform, O(log x log log x) depth, bounded fan-in division circuit specified in Lemma 5.4. Circuit DB, is at most a constant factor larger in size than PC,,. Hence, DB, uses depth O(log P(n)tlog(n+ T(n))loglog(n+ T(n))) to simulate each step of D. q Lemma 5.10. An off-line Turing machine can compute the quotient of two n-bit operands in O(log n log log n) space.
Proof. Borodin [2]
proved that an off-line TM can simulate a logspaceuniform circuit with bounded fan-in and depth D(n) in space O(o(n)).
Combined with the logspace-uniform O(log n log log n) depth division circuit of Shankar and Ramachandran [21] , we have the lemma. 0
Theorem 5.11. For all T(n)>log n, PRAM[t]-TIME(T(n)) c DSPACE( T'(n)).
Proof. Fortune and Wyllie [9] simulated each PRAM running in time T(n) by a TM running in space 0( T2(n)). They used recursive procedures of depth 0( T(n)) using space 0( T( n)) at each level of recursion. If we augment the simulated PRAM with division, then by Lemma 5.10, an additional O(log( y1+ T(n)) log log( n + T(n))) space is needed at each level, so 0( T(n)) space at each level still suffices. Hence, with linear space compression, a TM with space T2(n) can simulate a PRAM[ +] running in time T(n). 0
Simulation of PRAM[ *, t] by RAM[*, +]
In this section, we establish the MRAM-uniformity of a bounded fan-in circuit described by Shankar and Ramachandran 1211 that performs division in O(lognloglogn) depth and 0((1/64)n'+6 ), for 6 >O, size. This is the major step leading to a simulation of a PRAM[*, +] by a RAM [*, t]. Given two n-bit inputs, u and Y, the division problem is to compute their n-bit quotient, u/v. This reduces to the problem of efficiently computing the n-bit reciprocal of v. Shankar and Ramachandran first normalize v to a number in [l/2, l), set x=1-u,andthencompute1/(1-x)=1-tx+x2+x3+~~~+x"-1.Thekeyportionof their division algorithm, and the only portion for which we explicitly prove the circuit implementation to be MRAM-uniform, is an algorithm for computing the sth power of an r-bit number modulo 2'+l. We must show the following circuit components to be MRAM-uniform: discrete Fourier transform (DFT), DFT-', square root, xSmod 2'+ 1 for restricted r and s, and the circuit component corresponding to the case r<s2.
DF T uniformity
We establish here that a bounded fan-in circuit implementing the DFT is MRAMuniform. Below we state a DFT algorithm from Cooley and Tukey [7] as described by Quinn [16] To establish the MRAM-uniformity of this circuit, we must establish the uniformity of gate connections within each block. The uniformity of the connections between blocks is clear. We name gates such that the binary representation of the name can be partitioned into fields. The gate names are O(log k) = O(d) bits long. The [REVERSE] block corresponds to the following steps in the algorithm: block is clear because w is a power of two. In the above discussion, we have shown that a RAM [ *] can compute the input to a single gate g within the necessary time bounds when #g is appropriately masked to reveal only certain fields. To establish MRAM-uniformity, however, the RAM[*] must take the input I comprising all pairs of integers O(log k) bits long, mask these integers, and compute the inputs for the gates corresponding to all exposed fields simultaneously.
The masking is similar to that done to establish the MRAM-uniformity of the circuit simulating a PRAM [ *] in Section 4.3. Observe that any of the above procedures can operate on a long integer, comprising a set of fields separated by zeros. Hence, we obtain that the DFT circuit implementing the algorithm described above is MRAM-uniform.
Inverse DFT
For the inverse DFT circuit, the above discussion establishes the uniformity of most of its sections. The following step of the inverse DFT algorithm is one significant exception: (REVERSE(i,d) )/k. Note that k=O(log3j4 x) since r = log x and we are in the case r 3s'. Hence, the inverses needed can be obtained by generating a table for all possible relevant values. Since k is very small, this table can be generated with a uniform, polynomial (in r) size circuit. Thus, the inverse DFT circuit is MRAM-uniform.
Square root, xi2imod2k+ 1, and xi2-imod2k+ 1 uniformity As for the inverse DFT, the values needed to compute the square roots are very small and are obtained by generating a table for all relevant values. Thus, the square root circuit is MRAM-uniform.
Next, we want to establish the MRAM-uniformity of the portions of the circuit that compute xi2'mod 2k+ 1 and xi2 -'mod 2k+ 1. The values of Xi are in the range 0,...,2k-l;thevaluesofiareintherangeO,...,k-1.Sincexi2~'mod2k+1=xi2~i, we need only be concerned with computing ~~2~rnod 2k+ 1. From the bounds on i and xi, xi2' is at most 2k bits long. Split xi2' into two k-bit portions, denoting the lower-order portion by rO and the higher-order portion by rl. Since2k--lmod2k+1,wehavexi2'=rl2k+rg-(rO-rT1)mod2k+1,whichcanbe computed with a subtraction, a comparison, and an addition. Thus, this portion of the circuit is clearly MRAM-uniform.
Uniformity of the case r < s2
At this point, we consider the case r < s2 of the modular power algorithm. Again Let DC denote the family of bounded fan-in circuits described in Lemma 5.5 that simulates a PRAM[*, +] in depth 0(T2(n)logT(n)) and size 0(nT2(n) 32r'"'( n2+ T2(n))). W e construct a family of bounded fan-in circuits DC' from DC.
Fix an input size n. The circuit DC: is exactly the same as the circuit DC, except that DC; uses carry-save multiplication blocks for reasons of MRAM-uniformity (as in Lemma 4.9). DC; has depth 0( T2(n)logT(n)) and size O(nT2(n) 32r("'(n2+ T2(n))).
Lemma 5.13. For each n, every language recognized by a PRAM [ *, t ] MD in time T(n) with P(n) processors can be recognized by the bounded fan-in, MRAM-uniform circuit DC; of depth O(T2(n)log T(n)) and size O(nT2(n) 32T'"'(n2+T2(n))).
Proof. The proof of the PRAM [ *, t ] simulation is similar to that given for Lemma 4.5. By an argument similar to the proof of Lemma 4.7 using Lemma 5.12, DC' is MRAM-uniform. 0
Theorem 5.14. For all If a constant interval has length 1, then the entire interval is a single bit, which is an interesting bit. We call such an interesting bit a singleton. We mark interesting bits that are singletons. We define the MIB encoding as E (0) = OS, E(Ol)= Is, The marks permit the simulator to efficiently determine from E(x) and E(y) whether x+ 1 =y, using the procedure PLUS-ONE given below. 
. E(a,)q,, r.
A root node is associated with the entire encoding of E(d) and holds nothing. A nonroot node holds one of: OS; 1s; r, the start bit; or qj, the mark of the jth interesting bit. If a node holds OS, Is, or r, then it is a leaf. If a node holds qj, then it is an internal node, and its children are the nodes of E(aj). Figure 4 contains a sketch of the encoding tree of E(01100). For a node rx corresponding to E(ak)qk, the value of the subtree rooted at c(, &(a), is ak, the value of E(Q). Thus, vu1 (m) is the position of an interesting bit.
We define level 0 of a tree as the root. We define level j of a tree as the set of all children of nodes in level j-1 of the tree.
A pointer into an encoding specifies a path starting at the root of the tree. For instance, the pointer 7.5.9 specifies a path x0, x 1, x2, x3 in which x,, is the root, x1 is the 7th child (from the right) of x0, x2 is the 5th child of x1, and x3 is the 9th child of x2. A pointer also specifies the subtree rooted at the last node of the path.
For an integer d, suppose E(d)=(E(a,)q,,
. . ..E(aI)qI.r).
We define We now state three lemmas, analogous to those of Simon [22] , that bound the size of an encoding. Lemma 6.1 bounds the depth of an encoding and the number of interesting bits in a number generated by a PRAM [ r, 11. Let boo1 be a set of Boolean operations. The proof of Lemma 6.1 is straightforward. (
ii) 1f 0 is a Boolean operation, then deptk( rcon,( i)) d max { deptk( rcon,( j)), deptk(rcon,( k))} and intbits(rcon,( i)) d intbits(rcon,( j)) + intbits(rcon,( k)). (iii) If 0 is -, then deptk(rcon,( i)) < 1 + max{ deptk(rcon,( j)), deptk(rcon,(k))} and intbits(rcon,( i)) 6 intbits( rcon,( j)) + intbits(rcon,(k)).
and intbits(rcon,(i))< 1 + intbits(rcon,( j)).
Part (i) of Lemma 6.2 bounds the number of subtrees below first level nodes in an encoding;
Part (ii) bounds the number of subtrees below fth level nodes in an encoding, f> 1. The proof of Lemma 6.2 follows from Lemma 6.1. Let E(P) be either E(d) or a subtree of E(d). By Lemmas 6.1 and 6.2, intbits( /?) < n2 '(') The depth of the encoding is at most 2T(n), and every internal node has at most'n2'(") children. Therefore, the encoding may have up to 0(4T'(")) nodes, assuming T(n) > log n. 0 
Lemma
Suppose a processor P, executes r(i)+r( j) 0 r(k), where 0~(+,~,1,--,bool},E(rcon,(i))=(E(u,),...,E(a,);wi), E(rcon,(j))=(Wb,), . . . . E(b,);w,), and E(rcon,(k))=(E(c,), . . . . E(c,); wk), where a,, b,, and c, denote the positions of the vtk interesting bits of rcon,(i), rcon,( j), and rcon,( k), respectively. (i) For E(u,) (that is, the vtk subtree at level 1 ofE(rcon,(i))), (a) if 0 is + , then intbits( a,) < 1 + max, { intbits( b,), intbits( c,)}, (b) if 0 is t or 1, then intbits( a,) 6 maxq { intbits( b,)} + intbits( rcon,( k))}, (c) if 0 is a Boolean operation, then intbits( a,) < maxq { intbits( b,), intbits( c,)}, and
(d) if 0 is -, then intbits(u,) < 1 + max, { intbits( b,), intbits( c,)}. (ii) For E(p) a subtree at level f > 1, (a) if 0 is +, -,
. , O(P(n) T(n))
. Let q be a constant such that each processor in S' uses only q registers. By Lemma 6.3, for numbers generated by S (and therefore S'), the encoding may have up to 0(4T'(")) nodes. We construct a PRAM Z that simulates S via S' in O(T2(n)) time, using O(P+z)T(n)4T""' ) processors. For ease of description, we allow 2 to have q+ 7 separate shared memories, mewlO, . . , memq + 6, which can be interleaved. This entails no loss of generality and only a constant-factor time loss. Initialization.
Z partitions memo into 0( P( n) T ( Z activates O(P'( n) T( n)) primary processors, one for each processor used by S'. In mem,,,, these processors construct an address table. The jth entry of this table is j.4 TZ(n) the address of the first cell of the jth block in every memory. The maximum address'is 0(P2(n)T(n)4*""' ), so Z computes this address (and the entire table) in 0( T2 (n)) time.
Each primary processor now deploys 4T2(n) secondary processors, one for each cell in a block, in 0( T2(n)) time. To implement a broadcast in constant time, each primary proclssor P, uses cq+ 2( ) m as a communication cell. When the secondary processors are not otherwise occupied, they concurrently read this cell at each time step, waiting for a signal from the primary processor to indicate their next tasks. Consider a complete d-ary tree /1 with depth 2T(n). We number the nodes of /i starting with the root as node 1, in the order of a right-to-left breadth-first traversal.
Node number j has children numbered dj -(d -2), . . . , dj, dj + 1; its parent is numbered
We view a block as a linear array storing /1 with d =4T(n). Node numbers correspond to locations in the array. Let node(j) denote the node whose number is j.
Let num(cc) denote the node number of node a. For each primary processor, the jth secondary processor, 1 < j < 4 TZ(n) handles node(j). Let proc(cl) denote the secondary , processor assigned to node a.
Each encoding is a subtree of n because all encoding nodes have fewer than 4T(n) children. Let p(a) denote the parent of node cr; let rc(a) denote the rightmost child of node cr; let IC(CX) denote the leftmost nonempty child of node a. When a primary processor and its secondary processors update E(con(i)) or E(rcon,(i)), they also update num(lc(cr)) for every node CX. Let right(a) denote num(a)-num(rc (p(a))) . That is, right(a) denotes which child cx is of p(a), counting from the right. Similarly, let lef( a) denote num( a) -num( Ic(p( u))). That is, left(a) denotes which child CI is of p( a), counting from the left. Using memq + 2 for communication with primary processor P,,, corresponding to processor P, of S', proc(node(j)), 1 dj64T""', writes num(rc(node(j))) in ~~+~(j), nnm(p(node(j))) in ~~+~(j), and num(rc(p(node(j)))) in ~~+~(j) in O(T(n)) time. Then the processor for each node j can compute right(j).
All the addresses of cells accessed by S' can be constructed using only addition and subtraction.
In order to quickly perform indirect addressing, Z generates all cell and register contents in standard two's complement representation, except for results of shifts. The two's complement representation of local register rg(i) of S', if rcon,(i) is constructed without shifts, is stored in c 4+6(g(q+ l)+i). The two's complement representation of shared memory cell c(j) of S', if con( j) is constructed without shifts, is stored in c 4+ 6(( j + l)(q + 1)). If the value v in a register or a shared memory cell is the result of a shift, then S' does not use v as an address, and S' uses no other value computed from v as an address.
As the final initialization step, Z converts the input to the MIB encoding, writing the encoding into B,(O). Z writes the input integer in cq+6(q+ 1).
Simulation. In a general step of processor P, of S', P, executes instruction instr. Assume for now that instr has the form r(i)+r(j) 0 r(k). To simulate this step, the corresponding primary processor P, of Z and its secondary processors perform four tasks: Tusk 1. If 0 is not a shift, then perform 0 on coylq+ 6(g( q + 1) +j) and con,+6(g(q+1)+k), writing the result in con,+6(g(q+1)+i). Tusk 2. Merge the first level of the encodings E(rcon,(j)) and E(rcon,(k)).
Tusk 3.
Determine where the interesting bits of E(rcon,(i)) occur in the merged encodings and compute their marks.
Tusk 4.
Compress these marked interesting bits into the proper structure. Z uses procedures MERGE in task 2 and COMPRESS in task 4. Depending on the operation 0, Z may also use procedures BOOL and ADD in task 3. These procedures are described below.
Procedures MERGE, COMPRESS, BOOL, and ADD call procedure COMPARE, which we now specify. Let j and k be nonnegative integers, and let $1 and $z be encoding pointers. If m = /2, the empty string, then COMPARE( j, $1, k, ti2, m) compares the value of subtree E(con(j)).$r with the value of subtree E(con(k)).$,.
Similarly, if m#& then COMPARE compares the value of subtree E(rcon,( j)).lc/r with the value of subtree E(r~on,(k)).$~. Suppose m = A; the case m #2. is similar. For each node c( in the first level of E(con( j)). $ I simultaneously, proc(a) determines left(a). Then proc( c() computes num(/?) such that node p is in the first level of E(con(k)).$, and lef(P)=left(a) by reading num( Ic(E( con( k)). $2)). Next, proc( a) recursively compares the values of the subtrees rooted at G( and /?. COMPARE is recursive in the depth of the encoding, taking constant time at each level. Consequently, COMPARE( j, $1, k, ti2, m) takes 0( T(n)) time.
In task 2, Z merges the first level of the encodings E(rcon,( j)) and E(rcon,(k)). Z does this to compare the positions of interesting bits in rcon,( j) and rcon,(k). This comparison is necessary to determine the positions of the interesting bits in rcon,( i).
The subtrees rooted at the first level of E(d) form a list sorted in increasing order by their values. MERGE(j, k,i) returns, in B,(g), the list of up to 0(2T(")) subtrees resulting from merging the first levels of E( rcon,( j)) and E( rcon,( k)). Each subtree in the merged list retains indications of whether it is from j or k, whether it is the end of a constant interval of O's or l's, and its (singleton) mark. By comparing each subtree of the first level of E(rcon,( j)) with each subtree of the first level of E(rcon, (k)) in 0( T(n)) time, Z can perform a MERGE in 0( T (n)) time. (Note: Each subtree in the merged list also indicates whether its value is equal to that of the next subtree in the list. ) We introduce one more procedure before describing the computation of the interesting bits of rcon,( i). Let I(d) denote the MIB encoding of d without the marks.
PLUS_ONE(k,$,,
i, $2) writes I(val(E(rcon,(k)).$l/,)+ 1) in the location set aside for subtree 1c/* in Bi(g). That is, given E(d), for d an integer, PLUS-ONE writes Z(d+ 1). PLUS-ONE does not write singleton marks. Z uses PLUS-ONE to generate Z(d + 1) to test for equality with E(x), x an integer. The processors ignore marks to interpret E(x) as I(x). At most, the two rightmost interesting bits of d + 1 are different from those of d. Encoding I( d + 1) is easily generated by adding or deleting interesting bits and possibly recursively adding 1 by observing whether d starts with a 0 or a 1 and whether the first 0 is a singleton. PLUS-ONE is recursive with depth T(n), the depth of the encodings. PLUS-ONE uses constant time at each level, so O(T(n)) time overall.
We now are ready to describe how Z accomplishes task 3. Assume without loss of generality that i, j, and k are different. Z's actions in task 3 depend on the operation 0 in instr. Define an interval-pair to be the intersection of a constant interval in rcon,( j) and a constant interval in rcon,(k). For example, three interval-pairs, denoted by a, b, and c, are shown below: Proof. Fix an input length n. We construct UC,, from C, of Lemma 6.5. We reduce the fan-in in the portions of the circuit that simulate updates in the shared memory of Z. The circuit described in Lemma 4.3 allows all processors to attempt to simultaneously write the same cell. This does not occur in Z. During the execution of each procedure of Z over T(n) time steps, either 4T2(n) secondary processors concurrently write the same cell once or 4T(n) secondary processors concurrently write the same cell at each of 0( T(n)) levels of recursion. For the cases in which 4T2(n) secondary processors concurrently write the same cell in one time step, let these processors fan in their results by writing in groups of 4T(n) processors over T(n) time steps. Thus, we can modify Z such that at most 4 T(n) processors attempt to write the same cell at each time step, keeping the time for each procedure at 0( T(n)). By the construction given in Lemma 4.3, this leads to a maximum fan-in for any gate in UC,, of 0(4T'"'T"(n)) if T(n)>n or 0(4 T(n) T n)n) if T(n) < n. The circuit remains uniform after modifications ( to Z because the processors concurrently writing are all secondary processors belonging to the same primary processor. UC, has depth 0( T'(n)) and size 0(P4(n)T6(n)16T2(")(n+T2(n))). 0 O(T(n)(logP(n)T(n))), and the size is o(P(n)T(n)L(n)(L'(n)+P(n)~(n))), the same size as C,.
Lemma 6.9. BC' is VM-uniform.
Proof. By a proof similar to that of Lemma 4.7, C, and BCI, are VM-uniform. U Theorem 6.10. For all T(n)>logn and P(n)d2*'"', every language recognized with
P(n) processors in time T(n) by a PRAM can be recognized by a RAM [ t, _1] in time O(T(n)logP(n)T(n)).
Proof. By Lemma 6.9, BC', the family of bounded fan-in circuits that simulates a PRAM, is VM-uniform.
By Proof. Let UC=(UC1, UC2, . . . } be the family of unbounded fan-in circuits described in Lemma 6.6 that simulates a uniform PRAM [ t, 51. UC has the same form as C, except in the blocks labeled [Update-Common], handling updates to common memory. We reduce the inputs to the gates in this block because of restrictions on the processors that may simultaneously write a cell. As noted in the proof of Lemma 6.9, family C of unbounded fan-in circuits is VM-uniform. It is easy to compute the processors that may simultaneously write a cell, so UC is also VM-uniform.
Since UC is VM-uniform, by an argument similar to the proof of Lemma 4.7, BC is VMuniform. 0 J]-TIME( T(n)) G RAM[r, J]-TIME(T4(n)). The simulation of Theorem 6.12 is more efficient.
Summary and open problems
Summary
In this paper, we compared the computational power of time-bounded parallel random-access machines (PRAMS) with different instruction sets. We proved that polynomial time on a PRAM [*I or on a PRAM[*, t] or on a PRAM [t, J] is equivalent to polynomial space on a Turing machine (PSPACE). In particular, we showed the following bounds. Let each simulated machine run for T(n) steps on inputs of length n; let T denote T(n) in the table below. The simulating machines are basic PRAM, Turing machine, uniform family of bounded fan-in circuits, and RAM augmented with the same set of instructions.
The bounds for the simulating machine are expressed in time, space, or depth, as shown in parentheses by the machine type. be reduced to a polynomial in P(n)T(n)?
(2) Can a logspace-uniform, fan-in 2 O(logn)-depth circuit perform division? Beame et al. [l] developed a polytime-uniform division circuit. We could improve Theorems 5.2, 5.3, and 5.8 with a logspace-uniform, O(log n)-depth division circuit. (3) What are the corresponding lower bounds on any of these simulations? Are any of the bounds optimal? (4) As one of the first results of computational complexity theory, the linear speed-up theorem for Turing machines [13] states that for every multitape Turing machine of time complexity T(n) $ n and every constant c > 0, there is a multitape Turing machine that accepts the same language in time CT(~). The linear speed-up property of Turing machines justifies the widespread use of order-of-magnitude analyses of algorithms. Do PRAMS also enjoy the linear speed-up property?
