Abstract. It is shown that every unit-cost random-access machine (RAM) that runs in time T can be simulated by a concurrent-read exclusive-write parallel random-access machine (CREW PRAM) in time O(T 1/2 log T ). The proof is constructive; thus it gives a mechanical way to translate any sequential algorithm designed to run on a unit-cost RAM into a parallel algorithm that runs on a CREW PRAM and obtain a nearly quadratic speedup. One implication is that there does not exist any recursive function that is "inherently not parallelizable."
1. Introduction.
1.1. Motivation. For some problems, the direct parallelization of a sequential algorithm gives a faster parallel algorithm. An example is matrix multiplication. The brute-force sequential algorithm for matrix multiplication runs in O(n 3 ) time for n × n matrices. It is straightforward to parallelize this sequential algorithm to get an O(log n)-time parallel algorithm using O(n 3 /log n) processors. On the other hand, some problems are very difficult to parallelize. For example, depth-first search does not seem to admit itself to parallelization [7] . In this paper, we address the following question: Are all sequential algorithms parallelizable?
Cook and Reckhow [3] defined the unit-cost random-access machine (RAM). Fortune and Wyllie [6] introduced the parallel random-access machine (PRAM). These two models are, respectively, the most commonly used machine models for analyzing sequential and parallel algorithms. Thus the above question can be rephrased as follows: Given any unit-cost RAM R that runs in time T , is it always possible to construct a PRAM that simulates R in time T ′ = o(T )? We answer this question affirmatively by exhibiting such a construction with T ′ = O(T 1/2 log T ). Several variants of the PRAM have appeared in the literature since it was first introduced. The original model of Fortune and Wyllie has become known as the concurrent-read exclusive-write (CREW) PRAM, which is the model we use in our construction.
Parberry and Schnitger [15] considered the WRAM, a powerful variant of the PRAM. The WRAM differs from the CREW PRAM in three respects:
1. The WRAM is a concurrent-read concurrent-write (CRCW) priority PRAM [5] .
2. The WRAM has a richer instruction set for arithmetic operations. The CREW PRAM supports only addition and subtraction, whereas the WRAM also allows unit-time unrestricted right shifts and modulus operations.
3. The WRAM and the CREW PRAM differ in the manner in which the processors are activated. In the WRAM, an arbitrary number of processors are self-activated at the beginning of the computation. In the CREW PRAM, only one processor is active initially. An active processor activates an idle processor explicitly by executing a Fork instruction. Consequently, in t steps, a PRAM can activate at most 2 t processors.
Parberry and Schnitger showed that every Turing machine that runs in time T can be simulated in constant time by a WRAM with 2 O(T ) processors. The bestknown simulation of unit-cost RAMs by Turing machines incurs a cubic overhead in the running time [3] . It follows that every unit-cost RAM with time complexity T can be simulated in constant time by a WRAM with 2 O(T 3 ) processors. It is desirable to reduce this huge number of processors used in the simulation for two reasons. The first reason is, obviously, to reduce the hardware requirement.
The second and more important reason is that the ability of the WRAM to use an arbitrary number of processors renders this model unreasonably powerful. The above result of Parberry and Schnitger essentially says that every decidable problem can be decided in constant time by a WRAM. This anomaly arises mainly from allowing self-activated processors. The parallel-computation thesis [8, 14] asserts that the class of languages accepted by any reasonable parallel-machine model in polynomial time is equivalent to PSPACE, where PSPACE, as usual, denotes the class of languages that can be accepted by deterministic Turing machines in polynomial space. The WRAM violates the parallel-computation thesis and is considered unreasonably powerful [13] . In contrast, the PRAM is considered reasonable because it obeys the parallel-computation thesis [6] . So the challenge is to speed up a unit-cost RAM by a PRAM with a reasonable number of processors; the number of processors should be small enough so that all processors can be activated explicitly within the simulation time. [4] showed that every deterministic Turing machine running in time T can be simulated by a CREW PRAM in time O(T 1/2 ). However, the random-access memory of the PRAM is much more flexible than the linear tapes of the Turing machine, which forbid random access into individual tape cells. It was unclear whether it is the parallelism, the more flexible storage structure, or the combination of both that realizes such a quadratic speedup. Our result demonstrates that parallelism alone suffices to achieve an almost quadratic speedup.
Comparison with previous results. Dymond and Tompa
To the best of our knowledge, in all previous speedup results [4, 10, 13, 15, 16, 18, 20] , the machine being simulated is limited to the Turing machine. All these results depend on the fact that the changes in the configuration of a Turing machine in t steps are localized to the 2t − 1 cells around each tape head. In contrast, the random-access memory of a unit-cost RAM allows the RAM to change the contents of registers with widely different addresses in consecutive steps. The versatility of the random-access memory of a unit-cost RAM has defied all prior attempts to speed up a unit-cost RAM by a PRAM. This paper presents the first speedup theorem of unit-cost RAMs by PRAMs.
Reif [17] demonstrated that every probabilistic unit-cost RAM that runs in time T can be simulated by a probabilistic CREW PRAM in time t(T,L)= O ((T log T log(LT )) 1/2 ), where L is the largest integer manipulated by the probabilistic RAM during its computation. It is straightforward to modify Reif's proof to show that every unit-cost RAM running in time T can be simulated by a CREW PRAM in time t(T,L). With unit-time addition, however, a RAM can generate integers as large as 2 O(T ) in time T . Reif's result does not guarantee a speedup since t(L, T )=O(T(log T ) 1/2 )whenL=2 O(T) . Our result gives a definite speedup of unitcost RAMs by PRAMs, regardless of the value of L. It is routine to generalize our proof to establish a speedup theorem of probabilistic unit-cost RAMs by probabilistic CREW PRAMs. This paper subsumes the above result of Reif. Thus all algorithms (deterministic and probabilistic) are parallelizable.
In summary, all previous simulation results suffer from one or more of the following drawbacks:
1. No definite speedup is guaranteed (Reif [17] ). 2. The machine being sped up is limited to the Turing machine [4, 10, 13, 15, 16, 18, 20] .
3. The speedup result fails to isolate the effect of parallelism; that is, apart from the parallelism, the simulator enjoys some additional advantage over the machine being simulated-for example, a more flexible storage structure (Dymond and Tompa [4] ).
4. The simulator is too strong to be called reasonable because it violates the parallel computation thesis (Parberry and Schnitger [15] ). Our result does not suffer from any of the above drawbacks.
The rest of this paper is organized as follows. Section 2 defines the RAM and the PRAM models precisely. In section 3, we build up a repertoire of techniques for programming a PRAM efficiently. We use these techniques in section 4 to establish our main result: for every unit-cost RAM R with time complexity T , we construct a CREW PRAM that simulates R in time O(T 1/2 log T ). We conclude with a few comments in section 5. All logarithms are taken to base 2.
Definitions.
2.1. The unit-cost RAM. A RAM R consists of a memory,a n daprogram. The memory is an infinite sequence of registers (r(i)), i =0 ,1 ,....T h eaddress of r(i) is the integer i. Each register can hold an integer. Let r(i) denote the content of r(i)and| r(i) | denote the absolute value of r(i) . The program consists of a finite number of statements, numbered 1, 2,...,Q. Each statement contains one instruction. The allowed instructions are shown in Table 1 . The input of R is a binary number α = α 0 α 1 ...α n−1 , where each α i ∈{ 0 ,1 } . Initially, r(0),r(1),...,r(K − 1) hold some constant values required in the computation of R, where K is a constant that depends on R; r(K + i) holds α i for 0 ≤ i<n ,a n dr ( K+n ) holds −1 to mark the end of the input. All other registers contain 0. A unit-cost RAM executes each instruction in one step. Each step takes unit time. Thus step t takes a unit-cost RAM from time t − 1t ot i m et . The running time of a unit-cost RAM is the number of steps performed. Table 1 Instructions of a RAM.
Instruction
Meaning
If r(0) ≤ r(1) , then jump to statement q. Accept Accept and halt. Reject Reject and halt.
The PRAM.
A PRAM P comprises a collection of processors P (0), P (1), ..., which communicate via a global memory (g(i)). The initial contents of the global memory are as follows: the first K ′ global registers hold some constants, where K is another constant that depends on P ; the next n + 1 global registers hold the n input bits, followed by the end-of-input marker, and all other global registers contain 0. Every processor is a unit-cost RAM. Each P (p) has its own local memory (r p (i)) and can use every global memory register in the same manner as it uses a local memory register. In addition, each processor has an extra Fork q instruction for processor activation. Initially, only P (0) is active. Whenever a processor executes a Fork q instruction, a new processor is activated and starts running at statement q. When P (p) executes the Fork instruction the tth time, processor P (2 t−1 (2p + 1)) is activated. The processor id (PID) of P (p) is the integer p. When P (p) is activated, its local register r p (0) is initialized with its PID p, and all other local registers of P (p) contain 0. The PRAM P accepts if and only if P (0) executes an Accept instruction.
In a PRAM, several processors may attempt to access the same memory cell at the same time. A PRAM may allow concurrent-read and concurrent-write (CRCW) operations, concurrent-read and exclusive-write (CREW) operations, or exclusiveread and exclusive-write (EREW) operations [2, 22, 25] . In a CRCW PRAM, some mechanism is necessary to resolve the simultaneous write conflicts [2, 8, 21] . Fich et al. [5] studied the relationships between CRCW PRAMs with different conflict-resolution mechanisms.
In what follows, we restrict our attention to CREW PRAMs. Unless otherwise stated, our results also hold for CRCW PRAMs.
3. Techniques for programming PRAMs. In this section, we present several techniques for programming the PRAM. First, we show how to perform the following operations quickly on a PRAM: logical AND, summation, and multiplication of "small" integers. Second, we describe a fast implementation of multidimensional memory on a PRAM. Third, we explain how every processor can extract useful information from its PID efficiently.
3.1. Logical AND, summation, and multiple memories. It is convenient to interpret integers as logical values. We interpret a nonzero integer as true a n d0a s false.
Lemma 3.1.
[folklore] Suppose in a PRAM P , the global memory registers g(1), g(2), ...,g(n) store n integers k 1 , k 2 , ...,k n . Then P can find the sum and the logical AND of these n integers in O(log n) time.
By interleaving memory registers, Cook and Reckhow [3] demonstrated that a unit-cost RAM with a single memory can simulate a unit-cost RAM with multiple memories with merely a constant factor overhead in the running time. By applying the same technique to the PRAM, it is easy to prove the following lemma.
Lemma 3.2.
[folklore] Let γ>1 . Every PRAM with time complexity T and γ global memories (g 1 (i)), (g 2 (i)), ...,(g γ (i)) can be simulated in time O(T ) by a PRAM with one global memory.
3.2. Multiplication of small integers. Trahan et al. [24] studied PRAMs with unit-time multiplication. By the following lemma, we may assume that ordinary PRAMs can perform unit-time multiplication of "small" integers.
Lemma 3.3. Let P be a PRAM that (i) runs in time T and (ii) can perform unit-time multiplication on T -bit integers. Then P can be simulated by an ordinary PRAM in time O(T ).
Proof Table, and the Least-Significant-Bit Table are initialized, then for 0 ≤ i<2 T+1 , each P (i) computes the square of its PID using the paper-pencil multiplication method (repeated shift and add) and stores the result in sq(i). This takes O(T ) time. Hence all four tables can be precomputed in O(T ) time.
We have assumed that P ′ knows the value of T a priori. This assumption can be removed easily; P ′ just tries successive powers of two as an estimate of T . This modification does not increase the asymptotic running time of P ′ . [19] showed that ordinary RAMs can simulate multidimensional RAMs with only a constant-factor overhead in the running time. However, the proof of Robson cannot be adapted directly to prove the analogous result for PRAMs. Briefly, the reason is as follows. To simulate a RAM R with two-dimensional memory (r(i, j)) by an ordinary RAM R ′ with memory (r ′ (i)), Robson devised a mapping from the r(i, j)'s to the r ′ (i)'s. This mapping depends on the sequence of r(i, j)'s accessed during the computation of R,a n dR ′ constructs this mapping incrementally as it simulates R step by step. Consider applying the same idea to simulate a PRAM P with two-dimensional global memory (g(i, j)) by an ordinary PRAM P ′ with global memory (g ′ (i)). If we simulate each processor of P by a corresponding processor of P ′ as in the proof of Robson, then different processors of P may access the g(i, j)'s in different ways, and hence different processors of P ′ may have different mappings. Thus some processor of P ′ may think that the value of g(0, 0) is stored in g ′ (0), whereas another processor of P ′ thinks that the same value is stored in g ′ (1). Obviously, such a simulation of P by P ′ does not work. All in all, the analogous result for PRAMs does hold, as shown by the next lemma. Lemma 3.4. Every d-dimensional PRAM P running in time T can be simulated by an ordinary PRAM P ′ in time O(T ). Proof. P ′ uses processor P ′ (i) to simulate the corresponding processor P (i)o f P.E v e r yP ′ ( i ) simulates P (i) step by step. It suffices to explain how to emulate d-dimensional memories by one-dimensional memories. We demonstrate how P ′ (i)
Multidimensional memory.
emulates an access of P (i) to the d-dimensional global memory of P by an access to the one-dimensional global memory of P ′ . P ′ (i) uses its one-dimensional local memory to emulate the d-dimensional local memory of P (i) in a similar fashion.
P has global memory (g(i 1 ,i 2 ,...,i d )); P ′ has global memory (g ′ (i)). In time T , P (i) can produce integers no longer than CT bits for some constant C. Define b =2
CT , η is at most dCT bits long. By Lemma 3.3, we may assume that P ′ (i) can perform multiplication on dCT -bit integers in O(1) time. Thus P ′ (i) can compute η in O(1) time. Again, we have presumed that the value of T is available. This assumption can be removed in the same way as in the proof of Lemma 3.3.
Lemma 3.4 shows that without loss of generality, we may assume that CREW PRAMs have multidimensional memories. Apparently, some authors have used this fact without proof [4, 17] .
Extracting information from the PID.
The advantage of a PRAM over a RAM is that in a PRAM, many processors can work together in parallel. Clearly, this advantage is defeated if all processors just do the same thing on the same data, in which case one processor is as good as many. To take advantage of the parallelism, therefore, different processors have to operate differently. This is easily achieved by exploiting the distinctness of the PIDs; each processor consults its PID to determine its operation. For our later purpose, we require each processor to be able to look at successive single bits and successive O(log T ) bits of its PID in order to determine its operation. Next, we demonstrate that every PRAM can be modified to fulfill this requirement.
Let P be a PRAM with time complexity T . In time T , P can activate at most 2 T processors. The PID of every processor is at most T bits long. We modify P as follows.
1. P activates all 2 T processors before any actual computation. 2. P starts its computation by initializing in O(1) time a Least-Significant-Bit Table, a Right-Shift Table, and a Left-Shift Table, all of size 2 T , as described in the proof of Lemma 3.3. Using the first two tables, each processor can extract successive single bits of its PID, spending O(1) time per bit.
3. P implements two additional tables with global memories (lsb ′ (i)) and (rs
e., i right-shifted ⌊log T ⌋ bits. These two tables can be precomputed in O(log T ) time as follows. We presume the availability of the three tables mentioned in modification 2. For 0 ≤ i<2 T , processor P (i) does the following:
(i) Right shift its PID ⌊log T ⌋ times and store the result in rs ′ (i). (ii) Left shift rs ′ (i) ⌊log T ⌋ times, subtract the result from its PID, and store the difference in lsb ′ (i). Then each processor can extract successive ⌊log T ⌋ bits of its PID by table lookup, spending O(1) time per ⌊log T ⌋ bits. We have assumed that P knows a priori the values of T and ⌊log T ⌋. The knowledge of T is justifiable, as argued in the proof of Lemma 3.3, and ⌊log T ⌋ is simply the number of bits in the binary representation of T .
These modifications increase the running time of P by at most a constant factor.
Speedup of RAMs by PRAMs.
We now prove that the PRAM is always faster than the RAM.
Theorem 4.1. Every unit-cost RAM running in time T can be simulated by a
Let R be a unit-cost RAM with memory (r(i)) and time complexity T = T (n). We devise a CREW PRAM P with multiple multidimensional memories that simulates R in time O(T 1/2 log T ). Theorem 4.1 then follows from Lemmas 3.2 and 3.4. Let A be a large enough constant so that every address in the program of R can be encoded in A bits; we choose A to be at least 3 log 3 + 1 to suit our later purpose. As the input length n tends to infinity, so does T since T (n) ≥ n. Consequently, if n exceeds some constant n 0 , then AT 1/2 > log(2(T + K + n + 1)), where K is a constant that depends on R as explained in section 2.1. It suffices to argue that P runs in O(T 1/2 log T ) time for n>n 0 since we can modify P to handle inputs of length less than n 0 by table lookup. We assume that P knows the value of T 1/2 in advance. Otherwise, P tries successive powers of two as an estimate of T 1/2 . Now consider the unit-cost RAM R (with time complexity T ). In T 1/2 steps, the changes in the configuration of R are not localized to O(T 1/2 ) consecutive registers; the PRAM P cannot build a transition table for local configurations as in the case of the Turing machine. In time T , R can construct integers as large as 2 Θ(T ) . With indirect addressing, R may use these integers as addresses and assigns to register r(i) an integer j, where 0 ≤ i, j ≤ 2 Θ(T ) .I nTsteps, R can write to Θ(T ) different registers. Hence there are at least (2
A transition table that maps the current configuration of R to the configuration T 1/2 steps afterwards will have 2 Ω(T t and the contents of all registers at time t. Denote the configuration of R at time t by config(t).
Our simulation comprises two phases. In phase I, P uses O(T 1/2 log T ) time to
, the processors in group m perform some preprocessing such that after the preprocessing, config(mT 1/2 ) can be computed from config((m − 1)T 1/2 )inO(log T ) time. All groups do the preprocessing simultaneously. In phase II, P finds config(T ) as follows. The initial configuration of R, config(0), can be determined trivially. For m =1 , 2 ,...,T 1 / 2 , P computes config(mT 1/2 ) from config((m − 1)T 1/2 )i nO (log T ) time. Let q * be the statement number in config(T ). P accepts if and only if statement q * contains an Accept instruction. Both phases take O(T 1/2 log T ) time. Next, we present an efficient representation of the configuration of R and then provide the details of phases I and II.
4.2.
Representing the configuration of R. The PRAM P uses a data structure CONFIG to represent the configuration of R. One difficulty is that P cannot use a single register to store the content of a corresponding register of R. This is because R can generate integers as large as 2 O(T ) in time T , but P can produce integers no larger than
Without loss of generality, we assume that R represents negative integers using sign-and-magnitude representation; thus R works with nonnegative integers exclusively. With this simplifying assumption, every block in our blockwise representation is a nonnegative integer.
Initially, all registers of R contain 0, except for r(0),r(1),...,r(K + n). By convention, the first step of R is step 1, and r(0),r(1),...,r(K +n) are first written to in step 0 (i.e., they are initialized at time 0). For 0 ≤ i ≤ K + n,l e tu i denote r(i). For i>K+n, if a new register of R is written to in step i − K − n, then let u i denote this register; otherwise, u i is undefined. To describe the configuration of R at time t, it suffices to specify the statement number of R at time t and the address and content at time t of u i for 0 ≤ i ≤ t + K + n.
CONFIG consists of three global memories (a(i, j)), (c(i, j)), and (b(i)). To represent the configuration of R at time t, register b(0) holds the statement number of R at time t.I f0≤i≤t+K+nand u i is defined, then c(i, j) holds the jth block of B bits in u i .T h u s
jB . For brevity, we say that c(i) holds u i ,o r c ( i ) = u i , implying the blockwise representation. Register a(i) holds the address of u i in the same blockwise format. If t<i−K−n≤T or u i is undefined, then a(i) holds −1, and c(i) is not used. The number of registers that CONFIG uses is therefore O(NT)=O(T 3 / 2 ). Henceforth, when we mention config(t), we imply the above representation.
Phase I.
4.3.1. Static, dynamic, and effective instructions. Due to conditional jumps, a program statement may be executed more than once. For clarity, we distinguish between a static instruction and a dynamic instruction. The former is a static entity in a program statement of R. The latter is an executed instruction-an instance of a static instruction during the computation of R. A static instruction may correspond to none or many dynamic instructions.
Divide the dynamic instructions of R into three types: 1. Accept, Reject,a n dJump;
2. direct and indirect load and store instructions: r(i) ← r(j), r(i) ← (r(0)), and (r(0)) ← r(j);
3. arithmetic operations: r(i) ← r(j)+r(k)a n dr ( i )←r ( j)−r ( k ). Consider the effect of each dynamic instruction on the memory of R. A type-1 instruction does not change the content of any register. As far as the effect on the memory is concerned, a type-1 instruction is equivalent to a u 0 ← u 0 instruction. An instruction of type-2 copies the content of one register to another. Without loss of generality, we assume that r(K −
effective instructions of the above forms. Note that one static instruction may correspond to several dynamic instructions, each equivalent to a different effective instruction. to mT 1/2 ,a n dβ m is a binary string that encodes the outcomes of all conditional jumps between time (m − 1)T 1/2 and mT 1/2 . For uniformity, we view every static and dynamic instruction as a conditional jump. An Accept instruction, for example, may be viewed as a conditional jump where the condition is always false, and the destination of the jump is statement 1. In this way, β m is always of length T 1/2 .T h e triple π m specifies the behavior of R between time (m − 1)T 1/2 and mT 1/2 . Using the information contained in π m , group m performs some preprocessing that enables config(mT 1/2 ) to be computed from config((m − 1)T 1/2 ) quickly. One problem is that group m does not know π m in advance. To surmount this problem, group m uses enough processors to try all possible triples. Let Q be the number of statements in the program of R. The number of possible triples is thus Q×2
Group m uses T O(T 1/2 ) processors, which can be activated in O(T 1/2 log T ) time. Each processor is responsible for a distinct triple, which is encoded in the processor's PID. All processors in all groups carry out their preprocessing simultaneously in parallel.
We focus on one specific processor P π of group m, which is responsible for one particular triple π =( q, β, σ). Notice the difference in notation: the PID of P (p) is p, whereas the PID of P π is not π, but the triple π is encoded in the PID of P π . P π decodes its PID to obtain q, β,a n dσ . To obtain the sequence of T 1/2 effective instructions σ, P π extracts the least significant O(T 1/2 log T ) bits from its PID, O(log T ) bits at a time. Every effective instruction can be encoded in O(log T ) bits since there are O(T 3 ) different effective instructions. To recover the individual bits of β, P π extracts the next T 1/2 bits from its PID, one bit at a time. The next ⌊log Q⌋+1 bits of the PID constitute q. Using the techniques prescribed in section 3.4, P π can decode its PID in O(T 1/2 ) time. P π saves all decoded information in tables so that it can access each bit of β and each effective instruction in O(1) time by table lookup.
To facilitate our discussion, we say the triple π "happens" if the actual behavior of R conforms with the information contained in π.N o wπm a yo rm a yn o th a p pe n . I n phase I, P π performs some preprocessing so that in phase II, once config((m − 1)T 1/2 ) has been computed, P π is able to decide in O(log T ) time whether π actually happens and, if so, computes config(mT 1/2 ) from config((m−1)T 1/2 )inO(log T ) time. Because of the way we represent the configuration of R (section 4.2), to compute config(mT 1/2 ) from config((m − 1)T 1/2 ), it suffices to determine the following: 1. the statement number of R at time mT 1/2 ; 2. for (m − 1)T 1/2 <i−K−n≤mT 1/2 , the address of u i if u i is defined; 3. for 0 ≤ i ≤ mT 1/2 + K + n, the content of u i at time mT 1/2 if u i is defined. Below we explain the preprocessing that enables P π to determine each of the above three items efficiently in phase II, assuming π actually happens.
The statement number.
Starting from statement q, P π steps through the program of R statement by statement, following the flow of control defined by β. Meanwhile, P π keeps track of the statement number of R. After T 1/2 steps, P π obtains the statement number of R at time mT 1/2 . This preprocessing takes O(T 1/2 ) time.
The addresses of the u i 's.
In O(log T ) time, P π activates T 1/2 processors P i , where (m − 1)T 1/2 <i−K−n≤mT 1/2 . Each P i is responsible for finding the address of u i .
We fix i and describe P i . P i considers the effective instruction in step s = i−K −n given by σ. Suppose this effective instruction is of the form u i ′ ← u j ′ . Other cases are handled similarly. If i ′ = i, then no new register is written to in step s,a n du i is undefined. Otherwise, u i is the register first written to in step s. P i steps through the program of R in the same manner as described in section 4.3.3 and finds the dynamic instruction in step s. If this dynamic instruction is of the form r(j) ← r(k) or r(j) ← (r(0)), then the address of u i is j. The blockwise representation of j is readily obtained since all addresses in the program of R are at most A ≤ B bits long; the least significant B-bit block of j is just j itself, and all other blocks are 0. If the dynamic instruction in step s is of the form (r(0)) ← r(k), then the address of u i is the content of r(0) at time s − 1. Denote by u i ,t the content of u i at time t.I n section 4.3.5, we explain the preprocessing for finding u i ,mT 1 / 2 . P i performs the preprocessing for finding r(0),s−1 = u 0 ,s−1 in a similar fashion. 4.3.5. The contents of the u i 's. Since the sole arithmetic operations permitted are addition and subtraction, it follows that for a fixed π, u i ,mT 1 / 2 is a linear combination of the u i , (m − 1)T 1/2 's. Let
where the C ij 's are integer coefficients which depend only on π.I nO (log T ) time,
We fix i and j and describe P ij . P ij creates an empty directed multigraph G and then processes the effective instructions specified by σ one by one. As P ij considers each effective instruction, it inserts nodes and edges into G. P ij marks each edge either "positive" or "negative." Node w is a positive child of node v if the edge (v, w) is positive. A negative child is defined analogously. Let v + and v − , respectively, be the set of positive and negative children of v.
P ij maintains a counter τ to keep track of the step corresponding to the effective instruction currently under consideration. P ij initializes τ to (m − 1)T 1/2 + 1 and increments τ after every effective instruction. This will become clear after we explain how P ij constructs G. After processing all T 1/2 effective instructions, P ij uses G to obtain C ij .
4.3.6. Constructing G. P ij considers the T 1/2 effective instructions specified by σ one by one and constructs G as follows. For a u i ′ ← u j ′ instruction, P ij does the following. Create node
is processed in the same way except that both of the inserted edges are positive.
It is mechanical to verify that the above construction yields a graph which satisfies (1), and every node of the graph has out-degree at most two. We illustrate the above construction with an example for P π of group m = 1 with T 1/2 = 5. Figure 1 shows the effective instructions specified by π. The graph constructed by P ij appears in Fig. 2 .
Step Instruction
In this example, C 01 =5,C 02 = −2, C 11 = 3, and C 12 = −1. Fig. 1 . The T 1/2 =5effective instructions specified by π and their effect on the memory (example).
4.3.7.
Computing C ij . We explain how P ij uses G to compute C ij . P ij checks whether G contains a node [u i ,τ ′ ] for some τ ′ > (m − 1)T 1/2 . Case I (no such node exists). By construction of G, P ij will create node [u i ,τ ′ ] if R writes to u i in step τ ′ . The hypothesis thus implies that R does not write to u i between time (m − 1)T 1/2 and mT 1/2 . It follows that u i ,mT 1 / 2 = u i ,(m−1)T 1/2 . Ergo, C ij =0forj =i,a n dC ii =1.
Case II (otherwise). Let τ i be maximum such that G contains node [u i ,τ i ]. Similar arguments as in Case I give u i ,mT 1 / 2 = u i ,τ i . Consider the subgraph H of G induced by node [u i ,τ i ] and all its descendants. By construction, G (and hence H) is a directed acyclic multigraph. P ij sorts the nodes in H topologically and labels each edge in H with an integer as follows. P ij considers the nodes in H in topological order. For each node v, P ij labels the outgoing edges of v. When P ij considers node v, all incoming edges of v are labeled since P ij considers the Fig. 2 . The graph G constructed by P ij after processing the effective instructions in Fig. 1 .
nodes in topological order. Evidently, the first node considered is [u i ,τ i ]. P ij labels every positive and negative outgoing edge of [u i ,τ i ] with 1 and −1, respectively. For each remaining node v in topological order, let λ(H, v) be the sum of the labels on the incoming edges of v in H. P ij labels each positive and negative outgoing edge of v with λ(H, v)a n d− λ ( H, v) , respectively. Figure 3 shows the result of applying the above labeling algorithm to the graph in Fig. 2 . For s>0, let H s be the subgraph of H induced by all labeled edges after s nodes are considered. The leaves of H s , denoted by L(H s ), are the nodes in H s with no outgoing edges. The following invariant is a consequence of (1): After s nodes are considered, u i ,τ i = v∈L(Hs) λ(H s ,v) v . Therefore, after all edges in H are labeled,
] is a leaf of H; otherwise, C ij = 0. For the example in Fig. 3 , P 01 determines that C 01 =4+1=5,andP 02 concludes that C 02 = −2.
In the above labeling algorithm, the sum of the absolute values of all the labels on the edges of H s is at most triple that of H s−1 since every node in H has out-degree at most two. The number of nodes in H is |H|≤| G |≤3 T 1 / 2 because at most three nodes are created for each of the T 1/2 effective instructions. Therefore, |C ij |≤3
for all i and j. Each C ij is at most B = AT 1/2 bits long, since A ≥ 3 log 3 + 1. T h en u m b e ro fe d g e si nGis O(|G|), since each node has bounded out-degree. Constructing G and H, topologically sorting H, labeling the edges of H, and com-
Thus the bottleneck in phase I is the activation of enough processors to try all possible π, which takes O(T 1/2 log T ) time.
4.3.8. Table precomputation . In phase II, P has to extract efficiently the most and least significant B bits of a 2B-bit integer. In phase I, P precomputes t w ot a b l e s( h 1(i)) and (h2(i)) so that the first and second half of i can be ex- tracted in O(1) time by table lookup. P uses O(T 1/2 ) time to activate processors P (0),P(1),...,P(2 2B − 1) and builds up a Left-Shift Table and a Right-Shift Table  of size 2 2B as in the proof of Lemma 3.3. Next, for 0 ≤ i<2 2 B , each P (i) extracts in O(T 1/2 ) time the first and second halves of its PID as follows. The first half is obtained by shifting the PID right B times using the Right-Shift Table. The second half is obtained by shifting the first half left B times and subtracting the (shifted) first half from the PID. P (i) stores the first and second halves in h1(i)a n dh 2(i), respectively. Hence the two tables (h1(i)) and (h2(i)) can be precomputed in O(B)=O ( T 1 / 2 ) time.
4.4. Phase II. The data structure CONFIG has O(T 3/2 ) registers. In phase II, P initializes CONFIG in parallel using O(log T ) time so that CONFIG contains config(0). For m =1,2,...,T 1 / 2 , the T O(T 1/2 ) processors in group m do the following:
1. Each processor P π in group m checks in O(log T ) time whether π actually happens.
2. If so, compute config(mT 1/2 ) from config((m − 1)T 1/2 ) (stored in CONFIG) in O(log T ) time and update CONFIG accordingly. After T 1/2 updates, CONFIG contains config(T ). P accepts if and only if statement q * contains an Accept instruction, where q * is the statement number in config(T ). Notice that in step 1, exactly one P π determines that π happens. So in step 2, no write conflicts arise when updating CONFIG.
Next, we demonstrate that P π can compute config(mT 1/2 ) from config((m − 1)T 1/2 )i nO (log T ) time, provided that π actually happens. In section 4.5, we prove that O(log T ) time suffices to verify whether π actually happens. The preprocessing of section 4. C ij u j , (m − 1)T 1/2 ; 2. in phase I, P π dispatches processor P ij to calculate C ij ; 3. C ij is a B-bit integer. The product C ij c(j, k) is thus a 2B-bit integer. In phase II, the P ij 's cooperate to compute u i ,mT 1 / 2 in O(log T ) time as follows. The P ij 's use four multidimensional global memories (
, and (h ′ (i 1 ,i 2 ,i 3 )). Let p be the PID of P π .I nO (log T ) time, every P ij activates O(T ) processors P ′ k , where 0 ≤ k ≤ T +K +n. Each P ′ k multiplies C ij with c(j, k) and puts the most and least significant B bits of the product in g ′ (p, i, j, (k + 1)) and g(p, i, j, k), respectively. By Lemma 3.3, we may assume that the multiplication requires O(1) time. Extracting the most and least significant B bits also takes O(1) time as discussed in section 4.3.8. Then
From (2), (3), and (4),
Next, P π uses O(log T ) time to deploy O(T 3/2 ) processors P ′ ik , where 0 ≤ i ≤ T +K +n and 0 ≤ k ≤ N . Each P ′ ik computes the sum
in O(log T ) time (Lemma 3.1). The sum of 2(T + K + n + 1) integers, each B bits long, is at most B + log(2(T + K + n +1)) ≤ 2B bits long. P ′ ik extracts the most and least significant B bits of φ ik and places them in h ′ (p, i, (k + 1)) and h(p, i, k), respectively. Therefore, (5) and (6),
Consider the carries into and out of the kth B-bit block when we add ψ and ψ ′ together. By (7), the kth block of
except that we have to adjust for the carries into and out of the kth block. A carry into the block amounts to an increment by 1, whereas a carry out of the block is offset by subtracting 2 B . The value 2 B is precomputed during phase I in O(B)=O(T 1 / 2 ) time by repeated doubling. In section 4.4.2, we show that all block-to-block carries can be determined in O(log T ) time. To update c(i) with u i ,mT 1 / 2 ,e v e r yP ′ ik finds the kth block of u i ,mT 1 / 2 (by adding h ′ (p, i, k) and h(p, i, k) and adjusting for the carries) and updates c(i, k) accordingly. Hence P π is able to compute config(mT 1/2 ) from config((m − 1)T 1/2 )i nO (log T ) time during phase II. In the above discussion, we have presumed that all C ij 's are positive. Strictly speaking, to calculate u i ,mT
C ij c(j) , we have to sum up the positive and the negative components separately using the above method, do a blockwise subtraction, and adjust for the block-to-block borrows. The calculation of the borrows is analogous to that of the carries.
Computing the carries.
Consider adding two O(T )-bit integers together. By parallel-prefix computation [1, 11] , it is possible to determine all the bit-to-bit carries in O(log T ) time, provided that the individual bits of the integers are immediately accessible. In our case, however, the integers are represented in a blockwise instead of bitwise format. To apply the parallel-prefix technique, we formulate the computation of the block-to-block carries as a prefix-sum problem in a way slightly different from that in the bitwise case. The idea is to let a block take the place of a bit. Define a binary operation ⊗ on {ḡ,s,p} as follows:
It is routine to check that ⊗ is associative. For 0 ≤ k ≤ N +1, let
Intuitively, x k =ḡif a carry is "generated" in the kth block; x k =pif a carry is "propagated" through the kth block (i.e., there is a carry out of the kth block if and only if there is a carry into the kth block); and x k =sif a carry is "stopped" in the kth block (i.e., no carry out of the kth block regardless of whether there is a carry into the kth block). Let x −1 =s , and for
This implies y k =ḡif and only if there is a carry out of the kth block. By parallel-prefix computation, we can determine all the y k 's, and hence all block-to-block carries, in O(log N )=O (log T ) time.
4.5. Verifying π. During phase I, P π performs some additional preprocessing so that during phase II, P π can decide in O(log T ) time whether π actually happens. We first outline the verification process and then supply the details. and mT 1/2 . We say that π "happens up to time t" if the behavior of R from time (m − 1)T 1/2 to time t agrees with π. Similarly, we say that π "happens in step t"i f the behavior of R from time t − 1t ot i m etagrees with π. P π uses T 1/2 processors P * t , where (m − 1)T 1/2 ≤ t<mT 1 / 2 . Each P of these T 1/2 answers in O(log T ) time (Lemma 3.1) and decides whether π actually happens.
4.5.2. Preprocessing for verification. P π activates all P * t 's in phase I using O(log T ) time. Recall that in phase I, P π performs some preprocessing based on the triple π =( q, β, σ); if π happens, then this preprocessing enables P π to compute config(mT 1/2 ) from config((m − 1)T 1/2 )i nO (log T ) time. During phase I, every P * t performs the analogous preprocessing using the triple (q, β(t),σ(t)), where β(t)a n d σ ( t ) are prefixes of β and σ respectively that define the behavior of R between time (m − 1)T 1/2 and t.I fπhappens up to time t, then this preprocessing enables P * t to compute config(t) from config((m−1)T 1/2 )inO(log T ) time. As argued in section 4.3, this preprocessing takes O(T 1/2 ) time.
4.5.3. The actual verification. In section 4.4, we discussed how P π computes config(mT 1/2 ) from config((m − 1)T 1/2 ), provided that π actually happens. In an analogous manner, each P
The dynamic instruction in step t +1 corresponds to the static instruction in statement q ′ . Consider the effective instruction in step t + 1 specified by σ. P * t checks that the form of this effective instruction is "compatible" with the static instruction in statement q ′ . Table 2 shows the four categories of compatible instruction pairs. P * t performs some further checks according to the category of the compatible pair. Table 2 Compatible effective and static instruction pairs. In phase II, the PRAM P computes config(T )i nO ((T log ρ)/ρ) time as follows. For m =1,2,...,T/ρ, group m computes config(mρ) from config((m−1)ρ)inO(log ρ) time. Let q * be the statement number in config(T ). P accepts if and only if statement q * contains an Accept instruction. This simulation takes O(ρ log T +(T log ρ)/ρ ) time and uses T O(ρ) processors.
5. Discussion.
Parallelism always helps.
We have shown that we can always speed up a sequential computation on a unit-cost RAM by a CREW PRAM. We mentioned in section 1.1 that the unit-cost RAM is the most commonly used machine model for analyzing sequential algorithms. There are, however, other machine models of sequential computation, for example, the Turing machine, tree Turing machine, multidimensional Turing machine, and log-cost RAM. In a separate paper [12] , we show that a sequential computation on each of these other models can also be sped up by a corresponding parallel machine model:
1. Every tree Turing machine that runs in time T can be simulated by an alternating Turing machine in time O(T/log T ).
2. Every d-dimensional Turing machine that runs in time T can be simulated by an alternating Turing machine in time O(T 5 d log * T /log T ). 3. Every log-cost RAM that runs in time T can be simulated by an alternating log-cost RAM in time O(T log log T/log T ). We conclude that parallelism always helps us speed up a sequential computation.
5.2. Speedup using a polynomial number of processors. It is well known that the Turing machine enjoys the constant speedup theorem [26] : Let ǫ>0a n dM be a Turing machine with time complexity T ; then M can be simulated by another Turing machine in time ǫT + n. Hence efforts on speeding up the Turing machine have focused on asymptotic speedup [4, 10, 16] . The unit-cost RAM, however, does not enjoy the constant speedup theorem [23] ; that is, there exist an ǫ>0a n da unit-cost RAM R with time complexity T such that R cannot be simulated by any unit-cost RAM in time ǫT + n. Thus it is not trivial to speed up the computation of a unit-cost RAM by a constant factor. Theorem 4.2 shows that it is possible to speed up a unit-cost RAM by an arbitrary constant factor with a CREW PRAM using a polynomial number of processors.
Is result optimal?
We have constructed a simulator that runs in time O(T 1/2 log T ). We do not know whether our result is optimal, but we believe that it is difficult to reduce the simulation time by more than a log T factor because this would imply improvements over some best-known results, as explained below. We would like to call the reader's attention to the following previously established results:
1. Every CREW PRAM that runs in time T can be simulated by a Turing machine in space O(T 2 ) (Fortune and Wyllie [6] ). 2. Every Turing machine that runs in time T can be simulated by a unit-cost RAM in time O(T/log T ) (Hopcroft, Paul, and Valiant [10] ).
3. Every Turing machine that runs in time T can be simulated by another Turing machine in space O(T/log T ) (Hopcroft, Paul, and Valiant [9] ).
4. Every Turing machine that runs in time T can be simulated by a CREW PRAM in time O(T 1/2 ) (Dymond and Tompa [4] ). These are the best-known results for the respective simulations. For our problem, namely, simulation of unit-cost RAMs by CREW PRAMs, reducing the simulation time to o((T log T ) 1/2 ), together with the first result of Hopcroft et al. above, implies an improvement over the result of Dymond and Tompa. By the same reasoning, if we manage to reduce the simulation time to o(T 1/2 ), then we can simulate every Turing machine with time complexity T by a CREW PRAM in time o((T/log T ) 1/2 ). It then follows from the above result of Fortune and Wyllie that for Turing machines, time T can be simulated in space o(T/log T ), improving the second result of Hopcroft et al. above. This would be a significant breakthrough in simulating time by space for Turing machines.
