Parallelism Always Helps by Mak, Louis
September 1993 UILU-ENG-93-2239 
ACT-129
Applied Computation Theory
PARALLELISM 
ALWAYS HELPS
Louis Mak
Coordinated Science Laboratory 
College of Engineering
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Approved for Public Release. Distribution Unlimited.
unclassified
mTTwrr¿3«iMATiaN a? fws pací
REPORT DOCUM ENTATION PAGE
Form Approved 
OMB No. Q704-Q1B8
l a .  R E P O R T 'SECURITY CLASSIFICATION
Unclassified
1b. RESTRICTIVE MARKINGS
None
2a. SECURITY CLASSIFICATION AUTHORITY 3. DISTRIBUTION/AVAILABILITY OF REPORTApproved for public release; 
distribution unlimited2b. DECLASSIFICATION / DOWNGRADING SCHEDULE
4. PERFORMING ORGANIZATION REPORT NUMBER(S)
UILU-ENG-93-2239 
ACT-129
5. MONITORING ORGANIZATION REPOfiV NUMBfcKl»
6«. NAME OF PERFORMING ORGANIZATION
Coordinated Science Lab 
University of I^iqp^g
6b. OFFICE SYMBOL 
(If eppllcebie)
N/A
National Science Foundation
6c ADDRESS (City, State, end ZIP Code)
1101 W. Springfield Ave. 
Urbana, IL 61801
7b. ADDRESS (City, State, end ZIP Code)
Washington, DC 20050
8a. NAME OF FUNDING/SPONSORING 
ORGANIZATION
National Science Foundation
8b. OFFICE SYMBOL 
(If ipplidbie)
9. PROCUREMENT INSTRUMENT IDENTIFICATION MJMB^R
As AnnBPSS tntv State and ZIP Code) 10. SOURCE OF FUNDING NUMBERS * 1
Washington, DC PROGRAM ELEMENT NO. PROJECTNO. TASKNO.
WORK UNIT 
ACCESSION NO.
11. TITLE (Include Security Classification)
Parallelism Always Helps
12. PERSONAL AUTHOR(S)
Mak, Louis
13a. TYPE OF REPORT
Technical
13b. TIME COVERED 
FROM___________ TO
14. DATE OF REPORT (Year, Month, Day)
October 1993
15. PAGE COUNT
39
16. SUPPLEMENTARY NOTATION
17. COSATI CODES
FIELD GROUP SUB-GROUP
— — f----------------
18. SUBJECT TERMS (Continue on riven* if necessary end identify by block number)
random access machine, parallel random access machine, 
parallelization, simulation, computational complexity, 
time complexity __
19. ABSTRACT (Continue on reverse if necessity end identify by biock number)
It is shown that every unit-cost random access machine (RAM) that runs in time T can 
be simulated by a concurrent-read exclusive-write parallel random access machine (CREW 
PRAM) in time 0 { T l!2logT). The proof is constructive; thus, it gives a mechanical way 
to translate any sequential algorithm designed to run on a unit-cost RAM into a parallel 
algorithm that runs on a CREW PRAM and obtain a nearly quadratic speedup. One 
implication is that there does not exist any recursive function that is “inherently not 
parallesizable.”
20. DISTRIBUTION/AVAILABILITY OF ABSTRACT
Q  UNCLASSIFIED/UNLIMITED □  SAME AS RPT. □  DTIC USERS
21. ABSTRACT SECURITY CLASSIFICATION
Unclassified
22b. TELEPHONE (Include Area Code) I 22c. OFFICE SYMBOL22a. NAME OF RESPONSIBLE INDIVIDUAL
DO Form 1473. JUN 86 Previous editions ire  obsolete.
SECURITY CLASSIFICATION OF THIS PAGE.
UNCLASSIFIED
Parallelism Always Helps
Louis Mak*
Department of Computer Science and Coordinated Science Laboratory 
University of Illinois at Urbana-Champaign, Urbana, IL 61801 
email: mak@grinch.csl.uiuc.edu
September 23, 1993
Abstract
It is shown that every unit-cost random access machine (RAM) that runs in time T can be 
simulated by a concurrent-read exclusive-write parallel random access machine (CREW PRAM) 
in time 0 (T 1/2 log T). The proof is constructive; thus, it gives a mechanical way to translate any 
sequential algorithm designed to run on a unit-cost RAM into a parallel algorithm that rims on a 
CREW PRAM and obtain a nearly quadratic speedup. One implication is that there does not exist 
any recursive function that is “inherently not parallelizable.”
Categories and Subject Descriptors: F.1.1 [Com putation by Abstract Devices]: Models of 
Computation— bounded-action devices (e.g., Turing machines, random access machines); relations 
among models; F.1.2 [Com putation by Abstract Devices]: Modes of Computation — paral­
lelism and concurrency; relations among modes; F.1.3 [Com putation by Abstract Devices]: 
Complexity Classes — relations among complexity measures
General Terms: Theory
Additional Key Words and Phrases: Parallel random access machines, parallelization, complexity 
theory
‘ Supported by the National Science Foundation under Grant CCR-8922008. Author’s address: Coordinated Science 
Laboratory, University of Illinois at Urbana-Champaign, 1308 West Main St., Urbana, IL 61801.
1 Introduction
1.1 Motivation
For some problems, the direct parallelization of a sequential algorithm gives a faster parallel al­
gorithm. An example is matrix multiplication. The brute-force sequential algorithm for matrix 
multiplication runs in 0 (n 3) time for n x n matrices. It is straightforward to parallelize this se­
quential algorithm to get an O(logn) time parallel algorithm using 0 (n 3/logn ) processors. On the 
other hand, some problems are very difficult to parallelize. For example, depth-first search does not 
seem to admit itself to parallelization [6]. In this paper, we address the following question: Are all 
sequential algorithms parallelizable?
Cook and Reckhow [2] defined the unit-cost random access machine (RAM). Fortune and Wyllie 
[5] introduced the parallel random access machine (PRAM). These two models are respectively the 
most commonly used machine models for analyzing sequential and parallel algorithms. Thus, the 
above question can be rephrased as follows: Given any unit-cost RAM R that runs in time T, is it 
always possible to construct a PRAM that simulates R in time T  =  o(T)? We answer this question 
affirmatively by exhibiting such a construction with T' =  0 (T 1/2 log T). Several variants of the 
PRAM have appeared in the literature since it was first introduced. The original model of Fortune 
and Wyllie has become known as the concurrent-read exclusive-write (CREW) PRAM, which is the 
model we used in our construction.
Parberry and Schnitger [14] considered the WRAM, a powerful variant of the PRAM. The 
WRAM differs from the CREW PRAM in three respects:
1. The WRAM is a concurrent-read concurrent-write (CRCW) priority PRAM [4].
2. The WRAM has a richer instruction set for arithmetic operations. The CREW PRAM supports
1
only addition and subtraction, whereas the WRAM also allows unit-time unrestricted right 
shifts and modulus operations.
3. The WRAM and the CREW PRAM differ in the manner in. which the processors are activated. 
In the WRAM, an arbitrary number of processors are self-activated at the beginning of the 
computation. In the CREW PRAM, only one processor is active initially. An active processor 
activate an idle processor explicitly by executing a FORK instruction. Consequently, in t steps, 
a PRAM can activate at most 2* processors.
Parberry and Schnitger showed that every Turing machine that runs in time T can be simulated in 
constant time by a WRAM with 2 ° ^  processors. The best known simulation of unit-cost RAM’s 
by Turing machines incurs a cubic overhead in the running time [2]. It follows that every unit-cost 
RAM with time complexity T  cam be simulated in constant time by a WRAM with 2°(t3) processors. 
It is desirable to reduce this huge number of processors used in the simulation for two reasons. The 
first reason is, obviously, to reduce the hardware requirement.
The second and more important reason is that the ability of the WRAM to use an arbitrary 
number of processors renders this model unreasonably powerful. The above result of Parberry and 
Schnitger essentially says that every decidable problem can be decided in constant time by a WRAM. 
This anomaly arises mainly from allowing self-activated processors. The parallel computation thesis 
[7, 13] asserts that the class of languages accepted by any reasonable parallel machine model in 
polynomial time is equivalent to PSPACE, where PSPACE, as usual, denotes the class of languages 
that can be accepted by deterministic Turing machines in polynomial space. The WRAM violates 
the parallel computation thesis and is considered unreasonably powerful [12]. In contrast, the PRAM 
is considered reasonable because it obeys the parallel computation thesis [5]. So the challenge is 
to speed up a unit-cost RAM by a PRAM with a reasonable number of processors; the number
2
of processors should be small enough so that all processors can be activated explicitly within the 
simulation time.
1.2 Comparison with Previous Results
Dymond and Tompa [3] showed that every deterministic Turing machine running in time T  can 
be simulated by a CREW PRAM in time 0 (T 1/2). However, the random access memory of the 
PRAM is much more flexible than the linear tapes of the Turing machine, which forbid random 
access into individual tape cells. It was unclear whether it is the parallelism, or the more flexible 
storage structure, or the combination of both that realizes such a quadratic speedup. Our result 
demonstrates that parallelism alone suffices to achieve an almost quadratic speedup.
To the best of our knowledge, in all previous speedup results [3, 9, 12, 14, 15, 18, 20], the 
machine being simulated is limited to the Turing machine. All these results depend on the fact 
that the changes in the configuration of a Turing machine in t steps are localized to the 2t -  1 cells 
around each tape head. In contrast, the random access memory of a unit-cost RAM allows the 
RAM to change the contents of registers with widely different addresses in consecutive steps. The 
versatility of the random access memory of a unit-cost RAM has defied all prior attempts to speed 
up a unit-cost RAM by a PRAM. This paper presents the first speedup theorem of unit-cost RAM’s 
by PRAM’s.
Reif [16] demonstrated that every probabilistic unit-cost RAM that runs in time T  can be sim­
ulated by a probabilistic CREW PRAM in time t(T,L) =  0 ({T  log T log(LT))1/2), where L is the 
largest integer manipulated by the probabilistic RAM during its computation. It is straightforward 
to modify Reif’s proof to show that every unit-cost RAM running in time T  can be simulated by a 
CREW PRAM in time t(T , T). With unit-time addition, however, a RAM can generate integers as
3
large as 2 ° ^  in time T. Reif’s result does not guarantee a speedup, since t(L ,T ) =  0 (T (logT )1/2) 
when L — 2 °(T). Our result gives a definite speedup of unit-cost RAM’s by PRAM’s, regardless of 
the value of L. It is routine to generalize our proof to establish a speedup theorem of probabilistic 
unit-cost RAM’s by probabilistic CREW PRAM’s. This paper subsumes the above result of Reif. 
Thus, all algorithms (deterministic and probabilistic) are parallelizable.
In summary, all previous simulation results suffer from one or more of the following drawbacks:
1. No definite speedup is guaranteed (Reif [16]).
2. The machine being sped up is limited to the Turing machine [3, 9, 12, 14, 15, 18, 20].
3. The speedup result fails to isolate the effect of parallelism; that is, apart from the paral­
lelism, the simulator enjoys some additional advantage over the machine being simulated — 
for example, a more flexible storage structure (Dymond and Tompa [3]).
4. The simulator is too strong to be called reasonable, because it violates the parallel computation 
thesis (Parberry and Schnitger [14]).
Our result does not suffer from any of the above drawbacks.
The rest of this paper is organized as follows. Section 2 defines the RAM and the PRAM models 
precisely. In Section 3, we build up a repertoire of techniques for programming a PRAM efficiently. 
We use these techniques in Section 4 to establish our main result: for every unit-cost RAM R with 
time complexity T, we construct a CREW PRAM that simulates R in time 0 (T 1/2 log T). We 
conclude with a few comments in Section 5. All logarithms are taken to base 2.
4
Instruction Meaning
r(i) <- r(j) 
r{i) <- (r(0)) 
(r(0)) <- r(j) 
r(i) <r- r(j) +  r(k) 
r(i) <- r(J) -  r(k) 
Jump q 
Accept 
R eject
r(i) gets (r(j)).
r(i) gets <r(j)), where j  =  |(r(0)>|. 
r(i) gets (r ( j )), where i =  |(r(0))|. 
r(i) gets {r(j)) +  (r(k)). 
r(i) gets (r(j)) -  (r(k)).
If (r(0)) < (r(l)), then jump to statement q. 
Accept and halt.
Reject and halt.
Table 1: Instructions of a RAM
2 Definitions
2.1 The Unit-Cost R A M
A RAM R consists of a memory, and a program. The memory is an infinite sequence of registers
— 0 ,1 ,---- The address of r(i) is the integer i. Each register can hold an integer. Let (r(i))
denote the content of r{%) and |(7"(i))| denote the absolute value of (r(i)). The program consists of 
a finite number of statements, numbered 1, 2 , . . . ,  Q. Each statement contains one instruction. The 
allowed instructions are shown in Table 1. The input of R is a binary number a =  a0ai • • • o;n_i, 
where each € {0,1}. Initially, r (0 ) ,r ( l ) , .. . ,r (K  -  1) hold some constant values required in the 
computation of R, where K  is a constant that depends on R; r(K  +  i) holds cti for 0 < i < n, and 
r(K  +  n) holds - 1  to mark the end of the input. All other registers contain 0. A unit-cost RAM 
executes each instruction in one step. Each step takes unit time. Thus, step t takes a unit-cost RAM 
from time t -  1 to time t. The running time of a unit-cost RAM is the number steps performed.
2.2 The P R A M
A PRAM P  comprises a collection of processors P(0), P ( l ) , . . . ,  which communicate via a global 
memory (g(i)). The initial contents of the global memory are as follows: the first K' global registers
5
hold some constants, where K' is another constant that depends on P; the next n +  1 global registers 
hold the n input bits, followed by the end-of-input marker, and all other global registers contain 0. 
Every processor is a unit-cost RAM. Each P{p) has its own local memory (rp(i)) and can use every 
global memory register in the same manner as it uses a local memory register. In addition, each 
processor has an extra Fork q instruction for processor activation. Initially, only P(0) is active. 
Whenever a processor executes a FORK q instruction, a new processor is activated and starts running 
at statement q. When P(p) executes the Fork instruction the tth time, processor P(2i-1 (2p +  1)) is 
activated. The processor id (PID) of P(p) is the integer p. When P(p) is activated, its local register 
rp(0) is initialized with its PID p, and all other local registers of P{p) contain 0. The PRAM P 
accepts if and only if P(0) executes an ACCEPT instruction.
In a PRAM, several processors may attempt to access the same memory cell at the same time. 
A PRAM may allow concurrent-read and concurrent-write (CRCW) operations, concurrent-read 
and exclusive-write (CREW) operations, or exclusive-read and exclusive-write (EREW) operations 
[1, 22, 25]. In a CRCW PRAM, some mechanism is necessary to resolve the simultaneous write 
conflicts [1, 7, 21]. Fich et al. [4] studied the relationships between CRCW PRAM’s with different 
conflict-resolution mechanisms.
In the sequel, we restrict our attention to CREW PRAM’s. Unless otherwise stated, our results 
also hold for CRCW PRAM’s.
3 Techniques for Programming P R A M ’s
In this section, we present several techniques for programming the PRAM. First, we show how to 
perform the following operations quickly on a PRAM: logical AND, summation, and multiplication 
on “small” integers. Secondly, we describe a fast implementation of multidimensional memory on
6
a PRAM. Thirdly, we explain how every processor can extract useful information from its PID 
efficiently.
3.1 Logical AN D , Summation, and Multiple Memories
It is convenient to interpret integers as logical values. We interpret a nonzero integer as true and 0 
as false.
Lemma 1 [folklore] Suppose in a PRAM P , the global memory registers g( 1 ), #(2), . . . ,  g(n) store 
n integers k\, k ,^ . . . ,  kn. Then P can find the sum and logical AND of these n integers in O(logn) 
time.
By interleaving memory registers, Cook and Reckhow [2] demonstrated that a unit-cost RAM 
with a single memory can simulate a unit-cost RAM with multiple memories with merely a constant 
factor overhead in the running time. By applying the same technique to the PRAM, it is easy to 
prove the following lemma.
Lemma 2 [folklore] Let 7  > 1. A PRAM with time complexity T and 7  global memories (^i(i)), 
(P2W ); • • • f (flh(*)) can be simulated in time 0 (T ) by a PRAM with one global memory.
3.2 Multiplication on Small Integers
Trahan et al. [24] studied PRAM’s with unit-time multiplication. By the following lemma, we may 
assume that ordinary PRAM’s can perform unit-time multiplication on “small” integers.
Lemma 3 Let P be a PRAM that (i) runs in time T, and (ii) can perform unit-time multiplication 
on T-bit integers. Then P can be simulated by an ordinary PRAM in time 0 {T ).
7
P r o o f  We use a PRAM P' with multiple memories to simulate P. Lemma 3 then follows from 
Lemma 2. P' simulates P  step by step. We only need to show that P' can multiply in 0(1) 
time two T-bit integers. Multiplication reduces to squaring and halving via the identity xy =  
((x +  y )2 -  x2 -  y2)/2. For two T-bit integers x and y, x +  y and 2xy are respectively at most T -1-1 
and 2T + 1 bits long. It suffices to show that P' can perform in 0(1) time (i) squaring on (T +  l)-bit 
positive integers and (ii) halving (right shift) on (2T  +  l)-bit positive integers. Before simulating 
P, P' precomputes a Square Table of size 2T+1 and a Right Shift Table of size 22T+1. Then during 
the simulation, P' can perform squaring and halving in 0(1) time by table lookup. It remains to 
demonstrate that the Square Table and the Right Shift Table can be precomputed in 0 (T ) time.
P' uses four global memories (/s(z)), (rs(i)), (lsb(i)), and (sq(i)) to implement four tables:
1. Left Shift Table — (ls(i)) =  2i
2. Right Shift Table — (rs(i)) =  \i/2\
3. Least Significant Bit Table — (lsb(i)) =  least significant bit of i
4. Square Table — {sq{i)) =  i2
P' initializes the first three tables for 0 < i < 22T+1 as follows. In O(T) time, P' activates processors 
.P(O), P ( l ) , . . . ,  P(22T+1 — 1). Each processor does the following:
1. Store PID +  PID in /s(PID).
2. Store PID in rs(PID +  PID) and rs(PID +  PID +  1).
3. Store PID — ls(rs(PID)) in /sfr(PID).
Obviously, Steps (1), (2), and (3) take 0(1) time. After the Left Shift Table, Right Shift Table, and 
Least Significant Bit Table are initialized, then for 0 < i < 2T+1, each P(i) compute the square of
8
its PID using the paper-pencil multiplication method (repeated shift and add) and stores the result 
in sq(i). This takes O(T) time. Hence, all four tables can be precomputed in 0 (T ) time.
We have assumed that P' knows the value of T  a priori. This assumption can be removed easily; 
P' just tries successive powers of 2 as an estimate of T. This modification does not increase the 
asymptotic running time of P '. □
3.3 Multidimensional Memory
A d-dimensional RAM is one with memory (r (¿1 , i2, ...,«<*)), where *1 , ¿2, . .  •, V* > 0. A d-dimensional 
PRAM is one with global memory (<7(11, 12, • •. ,«<*)); each processor of a d-dimensional PRAM is 
a d-dimensional RAM. Robson [19] showed that ordinary RAM’s can simulate multidimensional 
RAM’s with only a constant factor overhead in the running time. However, the proof of Robson 
cannot be adapted directly to prove the analogous result for PRAM’s. Briefly, the reason is as 
follows. To simulate a RAM R with 2-dimensional memory (r (i,j))  by an ordinary RAM R! with 
memory (r'(i)), Robson devised a mapping from the r(i, j ) ’s to the r'(i)’s. This mapping depends 
on the sequence of r ( i , j ) ’s accessed during the computation of R, and R! constructs this mapping 
incrementally as it simulates R step by step. Consider applying the same idea to simulate a PRAM 
P  with 2-dimensional global memory (,g (i,j)) by an ordinary PRAM P' with global memory 
If we simulate each processor of P  by a corresponding processor of P' as in the proof of Robson, then 
different processors of P  may access the g (i ,j)'s in different ways, and hence, different processors 
of P' may have different mappings. Thus, some processor of P' may think that the value of g(0,0) 
is stored in g'(0), whereas another processor of P' thinks that the same value is stored in g'(l). 
Obviously, such a simulation of P  by P’ does not work.
All in all, the analogous result for PRAM’s does hold, as shown by the next lemma.
9
Lem m a 4 A d-dimensional PRAM P running in time T can be simulated by an ordinary PRAM  
P' in time 0 (T ).
P ro o f P' uses processor P'{i) to simulate the corresponding processor P(i) of P. Every P'{i) 
simulates P(i) step by step. It suffices to explain how to emulate d-dimensional memories by 
1-dimensional memories. We demonstrate how P'(i) emulates an access of P{i) to the d-dimensional 
global memory of P  by an access to the 1-dimensional global memory of P'. P'(i) uses its
1-dimensional local memory to emulate the d-dimensional local memory of P(i) in a similar 
fashion.
P  has global memory (g(ii, i2, . . . ,  id))\ P' has global memory (g'(i)). In time T, P(i) can 
produce integers no longer than CT bits for some constant C. Define b =  2CT and 77(21, ¿2, . . . ,  id) =  
ICjLi ijW 1- We map <7(21, 22, •••,2^ ) of P  to g’ (77(21, 22, . . .  , 2^ )) of P1. It is easy to verify that the 
77(21, 22, • • •, 2d)’s are distinct for 0 < ¿1 , *2, . . . ,  id < 2CT. To emulate an access to 0(21, ¿2, . . . ,  id) by 
P{i), P '(i) computes 27(21, 22, . . .  ,id) and accesses ^  (77(21, ¿2, . . .  , 2^ )).
It remains to show that computing 77 takes 0(1) time. By repeated doubling, 6 =  2CT can be 
precomputed in 0 (T ) time. For 21, 22, . . .  ,id < 2CT, 77 is at most dCT bits long. By Lemma 3, we 
may assume that P'(i) can perform multiplication on dCT-bit integers in 0(1) time. Thus, P'(i) 
can compute 77 in 0 (1 ) time.
Again, we have presumed that the value of T  is available. This assumption can be removed in 
the same way as in the proof of Lemma 3. □
Lemma 4 shows that without loss of generality, we may assume that CREW PRAM’s have 
multidimensional memories. Apparently, some authors have used this fact without proof [3, 16].
10
3.4 Extracting Information from PID
The advantage of a PRAM over a RAM is that in a PRAM, many processors can work together in 
parallel. Clearly, this advantage is defeated if all processors just do the same thing on the same data, 
in which case one processor is as good as many. To take advantage of the parallelism, therefore, 
different processors have to operate differently. This is easily achieved by exploiting the distinctness 
of the PID’s; each processor consults its PID to determine its operation. For our later purpose, we 
require each processor to be able to look at successive single bits and successive O(logT) bits of its 
PID in order to determine its operation. Next, we demonstrate that every PRAM can be modified 
to fulfill this requirement.
Let P  be a PRAM with time complexity T. In time T, P  can activate at most 2r  processors. 
The PID of every processor is at most T  bits long. We modify P  as follows.
1. P  activates all 2T processors before any actual computation.
2. P  starts its computation by initializing in 0(1) time a Least Significant Bit Table, a Right 
Shift Table, and a Left Shift Table, all of size 2T, as described in the proof of Lemma 3. Using 
the first two tables, each processor can extract successive single bits of its PID, spending 0(1) 
time per bit.
3. P  implements two additional tables with global memories (lsb'(i)) and (rs'(i)):
(i) lsb'(i) =  the least significant [logTJ bits of i
(ii) rs'(i) =  [y2LlogT-lj, i.e., i right shifted [logTJ bits
These two tables can be precomputed in O(logT) time as follows. We presume the availability 
of the three tables mentioned in modification 2. For 0 < i < 2T, processor P(i) does the 
following:
11
(i) Right shift its PID [logTJ times and store the result in rs'(i).
(ii) Left shift rs'(i) Llog T*J times, subtract the result from its PID, and store the difference 
in lsb'(i).
Then each processor can extract successive [log T*J bits of its PID by table lookup, spending 
0(1) time per |_log T1 J bits. We have assumed that P  knows a priori the values of T  and [log TJ. 
The knowledge of T is justifiable, as argued in the proof of Lemma 3, and [logTJ is simply 
the number of bits in the binary representation of T.
These modifications increase the running time of P  by at most a constant factor.
4 Speedup of R A M ’s by P R A M ’s
We now prove that the PRAM is always faster than the RAM.
Theorem  5 Every unit-cost RAM running in time T can be simulated by a CREW PRAM in time 
O f f 1/2 log T) processors.
Let R be a unit-cost RAM with memory (r(^)) and time complexity T  =  T(n). We devise a 
CREW PRAM P  with multiple multidimensional memories that simulates R in time O (T1/2 logT). 
Theorem 5 then follows from Lemmas 2 and 4. Let M  be a large enough constant so that every 
address in the program of R can be encoded in M  bits; we choose M  to be at least 3 log 3 +  1 to suit 
our later purpose. As the input length n tends to infinity, so does T, since T(n) > n. Consequently, 
if n exceeds some constant n0, then M T V2 > log(2(T +  K  +  n +  1)), where AT is a constant that 
depends on R as explained in Section 2.1. It suffices to argue that P  runs in OiT1!2 log T) time for 
n > no, since we can modify P  to handle inputs of length less than no by table lookup. We assume
12
that P  knows in advance the value of T 1/2. Otherwise, P  tries successive powers of 2 as an estimate 
of T 1/2.
4.1 Overview of Simulation
Fix an input a =  chocki • • • o:n_i and consider the computation of R on a. The configuration of R at 
time t consists of the statement number of R at time t and the contents of all registers at time t. 
Denote by config(t) the configuration of R at time t.
The simulation comprises two phases. In phase I, P  uses 0 (T 1/2 log T) time to activate T 1/2 
groups of T °(t ) processors. For 1 < m < T 1/2, the processors in group m perform some prepro­
cessing such that after the preprocessing, config(mT1^2) can be computed from config((m -  1)T1/2) 
in O(logT) time. All groups do the preprocessing simultaneously. In phase II, P  finds config(T) as 
follows. The initial configuration of R , config(0), can be determined trivially. For m =  1 ,2 ,.. . ,  T 1/2, 
P  computes config (mT1/2) from config((m -  1)T1/2) in O(logT) time. Let q* be the statement num­
ber in config(T). P  accepts if and only if statement q* contains an ACCEPT instruction. Both phases 
take 0 {T l/2\ogT) time. Next, we present an efficient representation of the configuration of R and 
then provide the details of phases I and II.
4.2 Representing the Configuration of R
P  uses a data structure CONFIG to represent the configuration of R. One difficulty is that P  cannot 
use a single register to store the content of a corresponding register of R. This is because R can 
generate integers as large as 2 °(T) in time T, but P  can produce integers no larger than T0(t1/2) 
within the intended 0 (T 1/2 log T) time bound. To overcome this difficulty, P  divides every 0(T)-bit 
integer into N  =  0 (T 1/2) blocks, each B =  M T1/2 bits long. Without loss of generality, we assume
13
that R represents negative integers using sign-and-magnitude representation; thus, R works with 
nonnegative integers exclusively. With this simplifying assumption, every block in our blockwise 
representation is a nonnegative integer.
Initially, all registers of R contain 0, except for r(0), r ( l ) , . . . ,  r(K  +  n). By convention, the first 
step of R is step 1, and r(0), r ( l ) , . . . ,  r(K  +  n) are first written to in step 0 (i.e., they are initialized 
at time 0). For 0 < i < K  +  n, let U{ denote r(i). For i > K  +  n, if a new register of R is written 
to in step i — K  — n, then let Ui denote this register; otherwise, Ui is undefined. To describe the 
configuration of R at time t, it suffices to specify the statement number of R at time t and the 
address and content at time t of ui for 0 < i < t +  K  +  n.
CONFIG consists of three global memories (a(i,j)), (c(i, j)) , and (b(i)). To represent the config­
uration of R at time t, register 6(0) holds the statement number of R at time t. J f 0 < i < t  +  K  +  n 
and Ui is defined, then c(i,j)  holds the j th block of B bits in (m). Thus, («,■) =  {c{i, j))2 jB .
For brevity, we say c{i) holds («,■), or (c(i)) =  (ut), implying the blockwise representation. Register 
a(i) holds the address of in the same blockwise format. If t < i -  K  -  n < T  or u{ is undefined, 
then a(i) holds —1, and c(i) is not used. The number of registers that CONFIG uses is therefore 
0 {N T )  =  0 (T 3/2).
Henceforth, when we mention config(t), we imply the above representation.
4.3 Phase I
4.3.1 Static, Dynamic, and Effective Instructions
Due to conditional jumps, a program statement may be executed more than once. For clarity, we 
distinguish between a static instruction and a dynamic instruction. The former is a static entity 
in a program statement of R. The latter is an executed instruction — an instance of a static
14
instruction during the computation of R. A static instruction may correspond to none or many 
dynamic instructions.
Divide the dynamic instructions of R into three types.
1. Accept, R eject, and Jump
2. direct and indirect load and store instructions (r(i) <— r(j), r (i) <— (r(0)), and (r(0)) +— r ( j ))
3. arithmetic operations (r(i) <— r(j) +  r(k ) and r(i) <— r(j) -  r(k ))
Consider the effect of each dynamic instruction on the memory of R. A type 1 instruction does 
not change the content of any register. As far as the effect on the memory is concerned, a type 1 
instruction is equivalent to a uq <— uq instruction. An instruction of type 2 copies the content of one 
register to another. Without loss of generality, we assume that r(K  -  1) always holds 0. Reading 
from an uninitialized register is the same as reading from r(K  -  1) =  uk- i , since all uninitialized 
registers contain 0. The effect of a type 2 instruction is thus u,-/ <— Uj> for some 0 < i\j' < T + K + n .  
A type 3 instruction finds the sum or difference of the contents of two registers and stores the result 
in a third register. The effect of a type 3 instruction is either u,-/ <— Uj> +  u*./ or it,-/ <— Uj> — it^ / 
for some 0 < i',j\k ' < T  +  K  +  n. Therefore, each dynamic instruction is, in effect, of the form 
Ui> «T- u j i ,  it,-/ <— Uji +itfc/, or it,-/ «— Uji -U k >  for some 0 < i',j',k ' < T  +  K  +  n. Since T =  T(n) > n, 
there are 0 (T 3) effective instructions of the above forms. Note that one static instruction may 
correspond to several dynamic instructions, each equivalent to a different effective instruction.
4.3.2 The Preprocessing
We fix m and describe the processors in group m. Let 7rm be the triple (gm, (3m, crm), where qm is the 
statement number of R at time ( m - 1 )T 1/2, <rm is the sequence of T 1/2 effective instructions from time
15
(m — 1 )T 1/2 to w T 1/2, and /3m is a binary string that encodes the outcomes of all conditional jumps 
between time (m — 1 )T 1/2 and mT1/2. For uniformity, we view every static and dynamic instruction 
as a conditional jump. An ACCEPT instruction, for example, may be viewed as a conditional jump 
where the condition is always false, and the destination of the jump is statement 1. In this way, /3m 
is always of length T 1/2. The triple 7rm specifies the behavior of R between time (m — l jT 1/2 and 
mT1/2. Using the information contained in 7rm, group m performs some preprocessing that enables 
config(mTl/2) to be computed from config((m -  1 )T1/2) quickly. One problem is that group m does 
not know 7rm in advance. To surmount this problem, group m uses enough processors to try all 
possible triples. Let Q be the number of statements in the program of R. The number of possible 
triples is thus Q x 2^ ~ x 0 (T 3)^  ^ Group m uses ^  processors, which can be
activated in 0 (T 1//2 logP) time. Each processor is responsible for a distinct triple, which is encoded 
in the processor’s PID. All processors in all groups carry out their preprocessing simultaneously in 
parallel.
We focus on one specific processor P* of group m, which is responsible for one particular triple 
* =  (?,/?, o’). Notice the difference in notation: the PID of P(p) is p, whereas the PID of PT is not 
7T, but the triple 7r is encoded in the PID of PT. Px decodes its PID to obtain q, /?, and a. To obtain 
the sequence of Tl/2 effective instructions a , Px extracts the least significant 0 (T 1/2 log T) bits from 
its PID, O(logT) bits at a time. Every effective instruction can be encoded in O(logT) bits, since 
there are 0 (T 3) different effective instructions. To recover the individual bits of /?, Pw extracts the 
next T 1/2 bits from its PID, one bit at a time. The next |>gQJ +  1 bits of the PID constitute q. 
Using the techniques prescribed in Section 3.4, PK can decode its PID in 0 (T X/2) time. saves all 
decoded information in tables so that it can access each bit of (3 and each effective instruction in 
0(1) time by table lookup.
16
To facilitate our discussion, we say the triple ir “happens” if the actual behavior of R conforms 
with the information contained in 7r. Now 7r may or may not happen. In phase I, PK performs 
some preprocessing so that in phase II, once config((m - 1 )T1/2) has been computed, P*. is able 
to decide in O(logT) time whether 7r actually happens and if so, computes config(mT1/2) from 
config((m — 1 )T1//2) in O(logT) time. Because of the way we represent the configuration of R (Sec­
tion 4.2), to compute config(mT1/2) from config((m — 1 )T1/2), it suffices to determine the following:
1. The statement number of R at time mT1/2.
2. For (m — 1)T1/2 < i — K  — n < mTx/2, the address of Ui if is defined.
3. For 0 < i < m T1/2 + K  +  n, the content of u* at time mT1/2 if U{ is defined.
Below we explain the preprocessing that enables Pv to determine each of the above three items 
efficiently in phase II, assuming 7r actually happens.
4.3.3 The Line Number
Starting from statement q, PT steps through the program of R statement by statement, following 
the flow of control defined by ¡3. Meanwhile, P*. keeps track of the statement number of R. After 
T 1/2 steps, Pn obtains the statement number of R at time mT1/2. This preprocessing takes 0 (T 1/2) 
time.
4.3.4 The Addresses of the ufs
In O(logT) time, PT activates T 1/2 processors P», where (m -  1JT1/2 < i -  K  - n  < mT1/2. Each Pt 
is responsible for finding the address of U{.
We fix i and describe Pt . Pi considers the effective instruction in step s =  i — K  — n given by 
<7. Suppose this effective instruction is of the form U{> <— Uj>. Other cases are handled similarly. If
17
i' 7^  î , then no new register is written to in step s, and U{ is undefined. Otherwise, is the register 
first written to in step s. Pz steps through the program of R in the same manner as described in 
Section 4.3.3 and finds the dynamic instruction in step s. If this dynamic instruction is of the form 
r (j) r (^) or r (j) 0"(0)), then the address of U{ is j. The blockwise representation of j  is readily 
obtained, since all addresses in the program of R are at most M < B bits long; the least significant 
5-bit block of j  is just j  itself, and all other blocks are 0. If the dynamic instruction in step s is of 
the form (r(0)) <— r(fc), then the address of U{ is the content of r(0) at time s — 1. Denote by (u,-, t) 
the content of U{ at time t. In Section 4.3.5, we explain the preprocessing for finding (u^mT1/2). Pi 
performs the preprocessing for finding (r(0), s -  1) =  (u0, s -  1) in a similar fashion.
4.3.5 The Contents o f  the s
Since the sole arithmetic operations permitted are addition and subtraction, it follows that for a 
fixed 7r, {ui,mT1/2) is a linear combination of the (U{, (m — l jT 1/2)^. Let
T+K+n
{Ui, rnTl/2) =  Cij{uj, (m -  1 ) T 1/2) ,
3=0
where the C j/s  are integer coefficients which depend only on 7r. In.O(logT) time, PK deploys 0 (T 2) 
processors PtJ-, where 0 < z, j  < T  +  K  +  n. Each Pty finds Cij in phase I.
We fix i and j  and describe P,y. PZJ- creates an empty directed multigraph G and then processes 
the effective instructions specified by a one by one. As PtJ- considers each effective instruction, it 
inserts nodes and edges into G. P{j marks each edge either “positive” or “negative.” Node w is a 
positive child of node v if the edge (u, w) is positive. A negative child is defined analogously. Let 
v+ and v~ respectively be the set of positive and negative children of v.
Pij maintains a counter r to keep track of the step corresponding to the effective instruction
18
currently under consideration. Pij initializes r to (m - l )T 1/2+ l  and increments r after every effective 
instruction. Pij names each node either [uk, (m -  1 )T1/2] or [uk,r] for some 0 < k < T  +  K  +  n. 
Intuitively, node [uk,t] represents (uk,t>. For a node v =  [u^t], we write (v) for (uk,t). The edges 
are marked such that
(v) =  ^2 iw) ~ ^2 (u;) f°r eack node v K (1)
u>€i;+ w£v~
This will become clear after we explain how P\j constructs G. After processing all T 1/2 effective 
instructions, Pij uses G to obtain C%j.
4.3.6 Constructing G
Pij considers the T 1/2 effective instructions specified by a one by one and constructs G as follows. 
For a Ui> <— Uj> instruction, P\j does the following. Create node [ut/,r]. If G does not contain a node 
[uj',Tr] for some r ' < r, then create node [uj>, (m -  1)T1/2]. Let Ty < r be maximum such that G 
contains node [uji,Tj>]. Insert edge [u /,r ; /]) and mark it positive.
Pij processes a u  ^ Uj< — uk< instruction as follows. Create node [it,-#, r]. E G does not contain
a node [uy, r'] for some r ' < r, then create node [uy, (m -  1 )T 1/2]. Similarly, if G does not contain a 
node [uk',r'] for some r ' < r, then create node [uk>, (m -  1 )T 1/2]. Let Tj/,Tfc/ < r be maximum such 
that G contains nodes [uy,Tj>] and [uki,Tk>]. Insert a positive edge ([uz/,r], [uy,Tji]) and a negative 
edge ([utv, r], [u /^, rfc']). A ut/ •(— uy +  uk> instruction is processed in the same way except that both 
of the inserted edges are positive.
It is mechanical to verify that the above construction yields a graph which satisfies (1), and every 
node of the graph has outdegree at most two. We illustrate the above construction with an example 
for PK of group m =  1 with T 1/2 =  5. Figure 1 shows the effective instructions specified by 7r. The 
graph constructed by P\j appears in Figure 2.
19
Step Instruction
1 u0 <— U\
2 wo "^ 0 +  wo
3 WO *— Wo — W2
4 u\ <— ui “I- wo
5 wq Wq +  Wi
Effect on Memory: (no, 5) =  5(ni,0) -  2(u2,0) 
(ni,5) =  3(ni,0) -  (n2,0)
In this example, Coi = 5, C02 =  -2 , C\\ = 3, 
and C12 = —1.
Figure 1: The T 1//2 =  5 effective instructions 
specified by 7r and their effect on the memory 
(example). Figure 2: The graph G constructed by P^ after 
processing the effective instructions in Figure 1.
4.3.7 Com puting Cij
We explain how Ptj uses G to compute Cij. Ptj checks whether G contains a node [nt-,r'] for some 
t' > (m — 1 )T 1/2.
Case I (No such node exists) By construction of G, Pij will create, node [n^r7] if R writes to nt- 
in step t'. The hypothesis thus implies R does not write to U{ between time (m -  1)T1/2 and mT1/2. 
It follows that {ui.mT1/2) =  (nt-, (m -  1 )T 1^2). Ergo, Cij =  0 for j  ^  i, and Cu =  1.
Case II (Otherwise) Let r,- be maximum such that G contains node [uj,r,-]. Similar arguments 
as in Case I give (u^mT1/2) =  (ut-,r,).
Consider the subgraph H of G induced by node rt] and all its descendants. By construction, 
G (and hence H ) is a directed acyclic multigraph. Pij sorts the nodes in H topologically and labels 
each edge in H with an integer as follows. Pij considers the nodes in H in topological order. For
20
Figure 3: The result of applying the labeling algorithm 
of Section 4.3.7 to the graph in Figure 2.
each node v, P^ labels the outgoing edges of v. When Pij considers node v, all incoming edges of v 
are labeled, since P\j considers the nodes in topological order. Evidently, the first node considered 
is [ui,Ti\. P^ labels every positive and negative outgoing edge of [ut-,Tj] with 1 and -1  respectively. 
For each remaining node v in topological order, let A(H,v) be the sum of the labels on the incoming 
edges of v in H. Pij labels each positive and negative outgoing edge of v with A(H,v) and -A (H,v) 
respectively. Figure 3 shows the result of applying the above labeling algorithm to the graph in 
Figure 2.
For s > 0, let Hs be the subgraph of H induced by all labeled edges after s nodes are considered. 
The leaves of Hs, denoted by L(H3), are the nodes in Hs with no outgoing edges. The following 
invariant is a consequence of (1): After s nodes are considered, (ut-,rt) =  E vel(tfs) A (if„  v)(v). 
Therefore, after all edges in H  are labeled, (u^mT1!2) =  E vgL(F) A (.H » (v). By construction,
21
every leaf of H is named [uj,(m -  1)T1/2] for some j .  Hence, CtJ- =  A(H,[Uj,(m -  1)T1/2]) if 
[uj, (m -  1)T1/2] is a leaf of H\ otherwise, Cij =  0. For the example in Figure 3, Poi determines that 
Coi =  4 +  1 =  5, and Pq2 concludes that C02 =  —2.
In the above labeling algorithm, the sum of the absolute values of all the labels on the edges of 
Hs is at most triple that of Hs- 1 , since every node in H has outdegree at most two. The number 
of nodes in H is \H\ < |G| < 3T 1/2, because at most three nodes are created for each of the T 1/2 
effective instructions. Therefore, |Qj\ < 33Tl/2 for all i ,j .  Each Q j is at most B =  M T 1/2 bits long, 
since M  > 3 log 3 +  1.
The number of edges in G is 0(|G|), since each node has bounded outdegree. Constructing G 
and H , topologically sorting H , labeling the edges of H, and computing A(H, [uj, (m -  1 )T 1/2]) all 
take 0(|G|) =  0 (T 1/2) time. Thus, the bottleneck in phase I is the activation of enough processors 
to try all possible 7r, which takes 0 {T 1/2 log T) time.
4.3.8 Table Precom putation
In phase II, P  has to extract efficiently the most and least significant B bits of a 2P-bit inte­
ger. In phase I, P  precomputes two tables (hl(i)) and (h2(i)) so that the first and second half 
of i can be extracted in 0(1) time by table lookup. P  uses 0 (T 1/2) time to activate processors 
-P(l)> • • •, P(22B -  1) and builds up a Left Shift Table and a Right Shift Table of size 22B as in 
the proof of Lemma 3. Next, for 0 < i < 22B, each P(i) extracts in 0(TV2) time the first and second 
halves of its PID as follows. The first half is obtained by shifting the PID right B times using the 
Right Shift Table. The second half is obtained by shifting the first half left B times and subtracting 
the first half from the PID. P(i) stores the first and second halves in hl(i) and h2(i) respectively. 
Hence, the two tables (hl(z)) and {h2{i)) can be precomputed in Q (B ) =  0 (T lP) time.
22
4.4 Phase II
The data structure CONFIG has 0 (T 3/2) registers. In phase II, P  initializes CONFIG in paral­
lel using O(logT) time so that CONFIG contains config(0). For m =  1 ,2 ,.. . ,  T 1/2, the T0(tV2) 
processors in group m do the following:
1. Each processor PT in group m checks in O(logT) time whether ir actually happens.
2. If so, compute config(mT1/2) from config((m -  1 )T 1/2) (stored in CONFIG) in O(logT) time 
and update CONFIG accordingly.
After T 1/2 updates, CONFIG contains config{T). P  accepts if and only if statement q* contains an 
Accept instruction, where q* is the statement number in config(T). Notice that in Step (1), exactly 
one PK determines that 7r happens. So in Step (2), no write conflicts arise when updating CONFIG.
Next, we demonstrate that P*. can compute config (mT1/2) from config((m -  1 )T1/2) in O(logT) 
time, provided that 7r actually happens. In Section 4.5, we prove that O(logT) time suffices to verify 
whether 7r actually happens. The preprocessing of Section 4.3.3 yields the statement number at time 
raT1/2 directly. For (m -  1)T1/2 < i — K  — n < mT1/2, the preprocessing of Section 4.3.4 either gives 
the address of U{ directly or reduces the problem of finding the address of m to that of finding the 
content of uq. It remains to explain how to determine the contents of the Uj’s.
4.4.1 Com puting Contents o f  Registers
We now explain how PT computes the contents of the u,-’s at time mT1/2 from the contents of the ut ’s 
at time ( m - l ) T 1/2. Suppose config((m -  1)T1/2) is available in CONFIG as described in Section 4.2. 
Then c(j, k) holds the kth R-bit block of (Uj, (m -  1 )T1/2). Recall that
1. (u^mT1/2) =  E jJo ir+nCii{«3- , (m - l )T ^ > .
23
2. In phase I, PT dispatches processor Pij to calculate Cij.
3. Cij is a JB-bit integer.
The product Cij(c(j, k)) is thus a 213-bit integer. In phase II, the P2j ’s cooperate to compute 
(uj-, raT1/2) in O(logT) time as follows. The P ;/s  use four multidimensional global memories 
2,*3>Ù)), 2,*3>ù)), (H h ,i2,*3)), and (/i,(*i ,*2>*3))- Let p be the PID of P*-. In
O(logP) time, every P2J- activates O(T) processors P'k, where 0 < k < T +  K  +  n. Each P'k multiplies 
Cij with (c (j , k)) and puts the most and least significant B bits of the product in g'(p, i , j , (fc +  1)) and 
g (p ,i,j,k )  respectively. By Lemma 3, we may assume that the multiplication requires 0(1) time. 
Extracting the most and least significant B bits also takes 0(1) time as discussed in Section 4.3.8. 
Then
T+K+n
(ui, mTlP) = E  C^cC;))
7=0
(2)
N- 1
(c(j)) = (3)
k=0
C ijicU M  = {g'(p,i,j, (k +  i)))2B +  (g{p,i,j,k)) (4)
From (2), (3), and (4),
T+K+n N
(ui,mT1/2) =  ^ ((¿ /(p ,i , j , f c ) )  +  (g (p ,i,j,k )))2kB (5)
j=o k=0
Next, Px uses O(logT) time to deploy 0 (T 3/2) processors P[k, where 0 < i < T + K + n  and 
0 < k < N.  Each p',k computes the sum
T+K+n
<f>ik=  ^2 ({g '(p ,i,j,k )) +  {g(p,i}j,k )))  
j=0
24
in O(logT) time (Lemma 1). The sum of 2(T +  K  +  n +  1) integers, each B bits long, is at most 
B +  log(2(T +  K  +  n +  1)) < 2B bits long. P'ik extracts the most and least significant B bits of <£,■*. 
and places them in h'(p,i, (k +  1)) and h(p,i, k) respectively. Therefore,
T+K+n
(id'(P,i, j, k)) +  (g(p,i, j, k))) =  ti(p,i, (k +  1))2B +  h(p,i,k) (6)
3=0
Let =  E £VW P,t\*)>2** and $  =  From (5) and (6),
AT+l
(«i,m T 1/2) =  +  =  ^  +  (7)
Ar=0
Consider the carries into and out of the kth B -bit block when we add ip and ip' together. By (7), the 
kth block of (ui, m T1/2) is (roughly) {h'(p,i, k)) +  (/i(p,z,fc)), except that we have to adjust for the 
carries into and out of the kth block. A carry into the block amounts to an increment by 1, whereas 
a carry out of the block is offset by subtracting 2B. The value 2B is precomputed during phase I in 
0 (B)  =  0(TV2) time by repeated doubling. In Section 4.4.2, we show that all block-to-block carries 
can be determined in O(logT) time. To update c(i) with (ut-, mT1/2), every P'k finds the kth block of 
(ui.mT1/2) (by adding (h'(p,i,k )) and (h(p,i,k)) and adjusting for the carries) and updates c(i,k ) 
accordingly. Hence, PK is able to compute config(mT1/2) from config((m -  1 )T 1/2) in O(logT) time 
during phase II.
In the above discussion, we have presumed that all C,y’s are positive. Strictly speaking, to calcu­
late (ui.mT1/2) =  Cij(c(j)), we have to sum up the positive and the negative components
separately using the above method, do a blockwise subtraction, and adjust for the block-to-block 
borrows. The calculation of the borrows is analogous to that of the carries.
25
4.4.2 Com puting the Carries
Consider adding two 0(T)-bit integers together. By parallel prefix computation [10, 17], it is possible 
to determine all the bit-to-bit carries in O(logT) time, provided that the individual bits of the 
integers are immediately accessible. In our case, however, the integers are represented in a blockwise 
instead of bitwise format. To apply the parallel prefix technique, we formulate the computation of 
the block-to-block carries as a prefix sum problem, in a way slightly different from that in the bitwise 
case. The idea is to let a block take the place of a bit. Define a binary operation 0  on {g, s,p}  as 
follows:
x 0  y —
V if V ^  P
(8)
x otherwise
It is routine to check that (8) is associative. For 0 < k < N +  1, let
Xk =  <
9 if (&'(?,»,&)) +  {h{p,i,k)) > 2b 
p if (h'(p,i,k)) +  k)} =  2b -  1
s otherwise
Intuitively, xk =  g if a carry is “generated” in the kth block; xk =  p if a carry is “propagated” 
through the kth block (i.e., there is a carry out of the kth block if and only if there is a carry into 
the kth block), and x k =  s if a carry is “stopped” in the kth block (i.e., no carry out of the kth block 
regardless whether there is a carry into the kth block). Let z_ i =  s, and for -1  < k < N +  1, let 
yk =  x - i  ® xo 0  x\ • • • (g) x k. By (8), yk =  xk>, where k' < k is maximum such that xk> ^  p. This 
implies yk — g if and only if there is a carry out of the kt^ 1 block. By parallel prefix computation, 
we can determine all the yk s, and hence all block-to-block carries, in O(log IV) =  O(logT) time.
26
4.5 V erify ing 7r
During phase I, Pn performs some additional preprocessing so that during phase II, PK can decide in 
O(logT) time whether tt actually happens. We first outline the verification process and then supply 
the details.
4.5.1 The Outline
Now 7r specifies the behavior of R between time (m -  1)T1/2 and raT1/2. We say 7r “happens up 
to time t” if the behavior of R from time (m -  1 )T1/2 to time t agrees with 7r. Similarly, we say 
7r “happens in step t” if the behavior of R from time t — 1 to time t agrees with 7r. Px uses T 1/2 
processors Pf, where (m -  1 )T1/2 < t <  mTV2. Each Pf checks whether tt happens in step t 4-1, 
assuming that 7r happens up to time t. Each P* obtains a true or false answer. Clearly, 7r happens 
if and only if all these answers are true. Pv calculates the logical AND of these T 1/2 answers in 
0(log T) time (Lemma 1) and decides whether 7r actually happens.
4.5.2 Preprocessing for Verification
Pir activates all Pt s in phase I using O(logT) time. Recall that in phase I, PK performs some 
preprocessing based on the triple 7r =  (^,/5, cr); if 7r happens, then this preprocessing enables PK 
to compute config{rnTll2) from config((m -  lJT1^2) in O(logT) time. During phase I, every P* 
performs the analogous preprocessing using the triple (g,/5(t), a(t)), where /3{t) and cr(t) are prefixes 
of /3 and a respectively that define the behavior of R between time (m -  1)T1/2 and t  If tt happens 
up to time t, then this preprocessing enables P* to compute config(t) from config((m -  1)T1/2) in 
O(logT) time. As argued in Section 4.3, this preprocessing takes 0 (T 1//2) time.
27
4.5.3 The A ctual Verification
In Section 4.4, we discussed how PT computes config(mT1/2) from config((m -  1)T1/2), provided that 
7r actually happens. In an analogous manner, each P* computes config(t) from config((m -  1 )T 1!2) 
during phase II, assuming that 7r actually happens up to time t. Using config(t), P* verifies whether 
7r happens in step t +  1 in O(logT) time.
If 7r actually happens, then every P* obtains a positive answer (true), and P*. deduces that 7r 
actually happens. Otherwise, let t' be maximum such that 7r happens up to time t'. Then P*, 
determines config(t') correctly and discovers that 7r does not happen in step t' +  1. P*, answers 
false, and Px infers that 7r does not happen. Note that for t > t' , the preprocessing of P* yields 
nothing useful, and Pf cannot compute config(t) correctly. This does not concern us, however, since 
the negative answer of Pf* renders other answers immaterial. We merely need to guarantee that P* 
finishes its preprocessing within 0 (T 1/2) time and produces some answer within O(logT) time. This 
is readily accomplished by having each P* count the number of steps it executes.
4.5.4 Verifying a Single Step
It remains to explain how P* uses config(t) to check whether 7r =  (g, /?, cr) happens in step t +  1. 
Recall the following facts:
1. The integer q specifies the statement number of R at time (m -  1)T1/2.
2. Vfe treat every static and dynamic instruction as a conditional jump, and (3 is a binary string of 
length T 1/2. For each dynamic instruction from step (m -  1)T1/2 +  1 to step mT1/2, ¡3 stipulates 
whether the condition of the jump is true or false.
28
3. Every dynamic instruction is equivalent to an effective instruction, and a gives the sequence 
of effective instruction from step (m — 1 )T 1(/2 +  1 to step mT1/2.
4. Every uninitialized register of R contains 0, and r(K  -  1) =  uK - 1  contains 0 throughout the 
computation of R.
To decide whether 7r happens in step t +  1, P* performs the following checks:
1. Check for q: Let q' be the statement number in config(t). If t =  (m — 1 )T 1//2, then P* checks 
that q =  q'.
2. Check for /?: Let s = t — (m — 1)T1/2 4- 1. The si/l bit of specifies whether the condition is 
true or false for the dynamic instruction in step t +  1. This dynamic instruction corresponds 
to the static instruction in statement q'. P* checks that the sth bit of is 1 if and only if 
statement q' indeed contains a Jump instruction, and the condition of the jump is true, i.e., 
{uo,t) < (u\,t) according to config(t).
3. Check for cr: P* checks that the effective instruction in step t + 1  specified by a is equivalent 
to the dynamic instruction in step t +  1.
The first two checks need no further explanation. We supply the details of the third check below. 
In Section 4.3.1, we showed that every dynamic instruction is equivalent to an effective instruction 
of the form u,-/ <— uy, <— uy +  uy, or u,-/ <— uy — Uk> for some 0 < < T +  K  +  n. The
dynamic instruction in step t +  1 corresponds to the static instruction in statement q'. Consider the 
effective instruction in step t +  1 specified by cr. P* checks that the form of this effective instruction 
is compatible with the static instruction in statement qf. Table 2 shows the four categories of 
compatible instruction pairs. P* performs some further checks according to the category of the 
compatible pair.
29
Category Effective Instruction Static Instruction
1 uo <— uo Accept, Reject, and Jump
2 U{i <—  Uji r(i) <- r(j),  r(i) (r(0)), and (r(0)) r(j)
3 U{I <—  Uj i  +  uki r(i) <— r(j)  +  r(k)
4 U{i  <—  Uj i  — U ki r(i) r(j) -  r(k)
Table 2: Compatible effective and static instruction pairs.
Category 1 No further check is necessary.
Category 2 The static instruction in statement q' is either r(i) <— r(j), r{i) <— (r(0)), or 
(r(0)) <— r(j)\ a stipulates that the effective instruction in step t +  1 is u¿/ <— uy. Let aw be the 
address of the register that is written to in step t +  1 , and ar be the address of the register that is 
read from in step t +  1. More precisely,
(r(0),t)
if the static instruction in statement q' is r(i) *— r(j)  or r(i) 
if the static instruction in statement q' is (r(0)) <— r(j)
(KO))
(r(0),t)
if the static instruction in statement q' is r{i) 
if the static instruction in statement q' is r{i)
r(j) or (r(0)) 
(r(0))
r(j)
According to a, R reads from uy and writes to uv in step t +  1. P* compares ar with the addresses 
of all Uk s in configit) in parallel. By definition, the addresses of all uk's are distinct. If ar equals 
the address of uk for some k, then P* checks that f  =  k. Otherwise, R reads from an uninitialized 
register in step t +  1; P* checks that j ' — K  — 1.
If i' =  t +  1 +  K  +  n, then according to <7, a new register is written to in step t +  1; P* checks 
that aw is different from the addresses of all uk s in config(t). I f i ' < t  +  l +  AT +  n, then according 
to (7, R writes to Ui> in step t +  1, but not for the first time; P* checks that aw is the address of 
in configit). If ir > t  +  l +  K  +  n, then a stipulates that in step t +  1 , R writes to which by
30
definition is the register first written to in step i ' - K - n > t  +  1. The information contained in a 
contradicts itself; Pf simply answers false.
Pt uses N  processors (comparators) to compare two addresses in the blockwise format for equal­
ity. Each comparator checks for equality in a corresponding block and obtains a true or false answer; 
Pt calculates the logical AND of these answers in O (log IV) =  O(logT) time (Lemma 1 ). To compare 
ar and aw against the addresses of all s in parallel, P* requires O(iVT) =  0 (T 3/2) comparators. 
Observe that although these comparators are used in phase II, all comparators can be pre-activated 
in phase I using O(logT) time. We will need this observation in Section 4.6.
Categories 3 and 4 These cases are similar to those in Category 2 .
Hence, PK can verify whether 7r actually happens in O(logT) time. This concludes the proof of 
Theorem 5. □
4.6 Time-Processor Tradeoffs
In this section, we discuss how to reduce the number of processors used in the simulation at the 
expense of increasing the simulation time. The following theorem is a generalization of Theorem 5.
Theorem  6 Let p >  0. Every unit-cost RAM that runs in time T can be simulated by a CREW  
PRAM in time 0 (p logT  +  (T\ogp)/p) with T ° ^  processors.
P r o o f  The proof of Theorem 6 is similar to that of Theorem 5. We explain how to modify the 
proof of Theorem 5 to establish Theorem 6. Instead of dividing each integer into N  =  0 (T 1/2) 
blocks, We divide every 0(T)-bit integer into N' -  0{p) blocks, each 0(T/p) bits long. We use 
T/p groups of processors. During phase I, group m performs the preprocessing based on the triple
(4 ) firm am)^  where q'm is the statement number at time {m -  l)p, is a binary string that encodes 
the outcomes of all condition jumps from time (m -  l)p  to time mp, and o ’m is the sequence of p
31
effective instructions between time (m -  1 )p and mp. In phase I, group m uses O(plogT) time to 
activate T ° ^  processors to try all possible triples. Similar analysis as in Section 4.3 reveals that 
the preprocessing takes O(p) time. Again, the bottleneck in phase I is the activation of enough 
processors to try all triples, which takes O(plogT) time.
The content of each Ui at time mp is a linear combination of the (it,-, (m — l)/?)’s. Let (U{,mp) =  
C'ijiuj, (m -  1 )p). Observe that such a linear combination has at most 0(p ) nonzero coefficients, 
since all arithmetic operations between time (m -  1 )p and mp involve at most O(p) registers. Let 
J — { j  | Clj 7^  0}. In Section 4.4.1, we described how to compute Z jJcF+nCij{uj, (m -  1 )T1!2) in 
0(\ogN +  log(T 4- K  +  n)) =  O(logT) time. Using the same method, we can compute (Ui,mp) =  
Cij{uj, {m — 1 )p) in 0(\ogN' +  log \ J\) =  O(logp) time. Verification of the triple also takes 
O(logp) time. Note that the verification of the triple requires 0{Tp) comparators, which can be 
pre-activated in 0 (logT  +  log/?) time during phase I. The preprocessing thus enables group m to 
compute config(mp) from config((m -  1 )/?) in 0(log/?) time during phase II.
In phase II, the PRAM P  computes config(T) in 0 ((T lo g /?)//?) time as follows. For m = '  
1 , 2 , group m computes config(mp) from config ((m -  l)p) in O(logp) time. Let q* be
the statement number in config(T). P  accepts if and only if statement q* contains an ACCEPT  
instruction. This simulation takes 0(/?logT  +  (T\ogp)/p) time and uses processors. □
5 Discussion
5.1 Parallelism Always Helps
We have shown that we can always speed up a sequential computation on a unit-cost RAM by a 
CREW PRAM. We mentioned in Section 1 .1  that the unit-cost RAM is the most commonly used
32
machine model for analyzing sequential algorithms. There are, however, other machine models of 
sequential computation, for example, Turing machine, tree Turing machine, multidimensional Turing 
machine, and log-cost RAM. In a separate paper [11], we show that a sequential computation on 
each of these other models can also be sped up by a corresponding parallel machine model:
1. Every tree Turing machine that runs in time T  can be simulated by an alternating Turing 
machine in time 0(T/\ogT).
2. Every d-dimensional Turing machine that runs in time T can be simulated by an alternating 
Turing machine in time 0 (T 5dlog* T/logT ).
3. Every log-cost RAM that runs in time T can be simulated by an alternating log-cost RAM in 
time 0 (T loglogT /logT ).
We conclude that parallelism always helps us speed up a sequential computation.
5.2 Speedup Using a Polynomial Number of Processors
It is well-known that the Turing machine enjoys the constant speedup theorem [26]: Let e > 0 and M  
be a Turing machine with time complexity T; then M  can be simulated by another Turing machine 
in time eT +  n. Hence, efforts on speeding up the Turing machine have focused on asymptotic 
speedup [3, 9, 15]. The unit-cost RAM, however, does not enjoy the constant speedup theorem [23]; 
that is, there exist an e > 0 and a unit-cost RAM R with time complexity T  such that R cannot be 
simulated by any unit-cost RAM in time eT +  n. Thus, it is not trivial to speed up the computation 
of a unit-cost RAM by a constant factor. Theorem 6 shows that it is possible to speed up a unit- 
cost RAM by an arbitrary constant factor with a CREW PRAM using a polynomial number of 
processors.
33
5.3 Is Result Optimal?
We have constructed a simulator that runs in 0(T ^2\ogT) time. We do not know whether our 
result is optimal, but we believe that it is difficult to reduce the simulation time by more than a 
logT factor, because this would imply improvements over some best known results, as explained 
below. We would like to call the reader’s attention to the following previously established results:
1. Every CREW PRAM that runs in time T can be simulated by a Turing machine in space 
0 (T 2) (Fortune and Wyllie, 1978 [5]).
2. Every Turing machine that runs in time T  can be simulated by a unit-cost RAM in time 
0 (T /logT ) (Hopcroft, Paul, and Valiant, 1975 [9]).
3. Every Turing machine that runs in time T  can be simulated by another Turing machine in 
space 0 (T /log T ) (Hopcroft, Paul, and Valiant, 1977 [8]).
4. Every Turing machine that runs in time T  can be simulated by a CREW PRAM in time 
0 (T 1/2) (Dymond and Tompa, 1985 [3]).
These are the best known results for the respective simulations. For our problem, namely, simulation 
of unit-cost RAM’s by CREW PRAM’s, reducing the simulation time to o((Tlog T )1/2), together 
with the first result of Hopcroft et al. above, implies an improvement over the result of Dymond and 
Tompa. By the same reasoning, if we manage to reduce the simulation time to o(T1/2), then we can 
simulate every Turing machine with time complexity T  by a CREW PRAM in time o ((T /logT )1/2). 
It then follows from the above result of Fortune and Wyllie that for Turing machines, time T  can 
be simulated in space o(T /logT), improving the second result of Hopcroft et al. above. This would 
be a significant breakthrough in simulating time by space for Turing machines.
34
References
[1] BORODIN, A., AND Ho pc r o ft , J. E. Routing, merging, and sorting on parallel models of 
computation. J. Comput. System Sci. 30 (1985), 130-145.
[2] COOK, S. A., AND R eckhow , R. A. Time bounded random access machines. J. Comput. 
System Sci. 7 (1973), 354-375.
[3] Dym o n d , P. W ., and T om pa , M. Speedups of deterministic machines by synchronous parallel 
machines. J. Comput. System Sci. 30 (1985), 149-161.
[4] FlCH, F. E., Ragde, P., AND WlGDERSON, A. Relations between concurrent-write models 
of parallel computation. SIAM J. Comput. 17 (1988), 606-627.
[5] Fortune , S., and W yllie , J. Parallelism in random access machines. In Proc. 10th Ann. 
ACM Symp. on Theory of Computing (1978), pp. 114-118.
[6] G ibbons, A., AND Ry t t e r , W. Efficient Parallel Algorithms. Cambridge University Press, 
Cambridge, England, 1988.
[7] GOLDSCHLAGER, L. M. A universal interconnection pattern for parallel computers. J. Assoc. 
Comput. Mach. 29 (1982), 1073-1086.
[8] Ho p c r o f t , J., Paul, W ., and Valian t , L. On time versus space. J. Assoc. Comput. Mach. 
24, 2 (1977), 332-337.
[9] Ho p c r o f t , J. E., Paul, W. J., and Valiant , L. G. On time versus space and related 
questions. In Proc. 16th Ann. IEEE Symp. on Foundations of Computer Science (1975), pp. 57-  
64.
35
[10] LEIGHTON, F. T. Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hy­
percubes. Morgan Kaufmann, San Mateo, California, 1992.
[11] M a k , L. Are parallel machines always faster than sequential machines? Submitted to SIAM 
J. Comput. for publication (1993).
[12] PARBERRY, I. Parallel speedup of sequential machines: a defense of the parallel computation 
thesis. ACM SIGACT News 18 (1986), 54-67.
[13] PARBERRY, I. Parallel Complexity Theory. Wiley, New York, 1987.
[14] PARBERRY, I., AND SCHNITGER, G. Parallel computation with threshold functions. J. Comput. 
System Sci. 36 (1988), 278-302.
[15] Paul, W ., AND R eischuk, R. On alternation II. Acta Inform. 14 (1980), 391-403.
[16] R eif, J. H. On synchronous parallel computations with independent probabilistic choice. SIAM 
J. Comput. 13 (1984), 46-56.
[17] R eif, J. H., Ed. Synthesis of Parallel Algorithms. Morgan Kaufmann, San Mateo, California, 
1993.
[18] R obson , J. M . Fast probabilistic R A M  simulation of single tape Turing machine computations. 
Inform, and Control 63 (1984), 67-87.
[19] R obson , J. M. Random access machines with multi-dimensional memories. Inform. Process. 
Lett. 34 (1990), 265-266.
[20] R obson , J. M. Deterministic simulation of a single tape Turing machine by a random access 
machine in sub-linear time. Inform, and Comput. 99 (1992), 109-121.
36
[21] SHILOACH, Y., AND V ishkin, U. Finding the maximum, merging, and sorting in parallel 
computation model. J. Algorithms 2 (1981), 88-102.
[22] Snir , M. On parallel searching. SIAM J. Comput. 14 (1985), 688-708.
[23] SUDBOROUGH, I. H., AND ZALCBERG, A. On families of languages defined by time-bounded 
random access machines. SIAM J. Comput. 5 (1976), 217-230.
[24] T rahan , J. L., Loui, M. C., and R amachandran , V. Multiplication, division, and shift 
instructions in parallel random access machines. Theoret. Comput. Sci. 100 (1992), 1 - 44.
[25] VISHKIN, U. Implementation of simultaneous memory address access in models that forbid it.
J. Algorithms 4 (1983), 45-50. - -
[26] Y a p , C. K. Theory of Complexity Classes. Oxford University Press, Oxford, England, to 
appear.
37
