The log cost measure has been viewed as a more reasonable method of measuring the time complexity of an algorithm than the unit cost measure. The more widely used unit cost measure becomes unrealistic if the algorithm handles extremely large integers. Parallel machines have not been examined under the log cost measure. In this paper, we investigate the Parallel Random Access Machine under the log cost measure. Let the instruction set of a basic PRAM include addition, subtraction, and Boolean operations. We relate resource-bounded complexity classes of log cost PRAMs to complexity classes of Turing machines and circuits. We also relate log cost PRAMs with di erent instruction sets by simulations that are much more e cient than possible in the unit cost case. Let LCRCW k (CRCW k ) denote the class of languages accepted by a log cost (unit cost) basic CRCW PRAM in O(log k n) time with polynomial in n number of processors. We position the log cost PRAM in the hierarchy of parallel complexity classes as: AC k = CRCW k (NC k+1 ; LCRCW k+1 ) AC k+1 = CRCW k+1 .
Introduction 1.PRAMs and Log Cost
Usually, one measures the time complexity of a Random Access Machine (RAM) algorithm as the number of steps executed. Under this unit cost measure, a RAM takes one unit of time to execute an instruction on any pair of operands stored at any pair of locations in memory, regardless of the operand and address lengths. Alternatively, one may analyze time complexity under the log cost measure, in which the time cost of an instruction execution depends on the operand and address lengths. The name \log cost" arises because the length of an integer is the logarithm of its value. The log cost time complexity of an algorithm is proportional to the time required to run the algorithm on a computer with xed register length.
Since a RAM may contain an arbitrary number of registers, each of which may contain an arbitrary integer, the di erence between the two measures may be signi cant. If a computer Q with xed word-length registers attempts to run a RAM algorithm, then Q would require (log x) registers to store a number x that is held in a single register of the RAM. Any attempt by Q to read or operate on x will take (log x) time units. In essence, running a RAM algorithm on Q requires multiple precision arithmetic, costing multiple units of time to execute each instruction. Further, since memory is typically con gured in a hierarchy of levels, the time cost to access a register is a nonconstant function of the address. As ideal RAMs do not exist in reality, a unit cost time bound may not be realistic, particularly when an algorithm manipulates very long operands or accesses operands at very large addresses. This fact supports the use of the log cost measure as more realistic.
Investigations of RAM instruction sets under the unit cost measure through simulations by and of the RAM have given rise to such long integers 1;2;3;4 . Just as these considerations apply to sequential algorithms designed for a RAM, they also apply to parallel algorithms designed for a Parallel Random Access Machine (PRAM). Previous studies of the log cost measure have focused only on the sequential RAM.
In this paper, we present a detailed analysis of the computation capabilities of a PRAM under the log cost measure. A basic RAM instruction set contains addition, subtraction, Boolean operations, predicates, and indirect addressing. Let (P)RAM op] denote a (P)RAM with the instruction set augmented by set op of instructions. We relate time-bounded computation abilities of PRAMs with di erent instruction sets to each other and to resource-bounded computation abilities of Turing Machines (TMs) and uniform circuits. The instruction sets under consideration are the basic instruction set and the basic instruction set augmented with the following sets of instructions: f g; f ; g; f"; #g, and f ; "; #g, where " and # denote left and right shift instructions, respectively. The study establishes close relations among PRAMs with di erent instruction sets. In addition, this paper shows that the class of languages accepted by polylogarithmic time-bounded and polynomial processor-bounded PRAMs under the log cost time measure is equivalent to NC, as is the case under the unit cost measure. This equivalence also extends to instruction sets augmented with multiplication and division or arbitrary shifts, an equivalence not known to hold under unit cost.
Previous Work
Researchers have related the RAM and PRAM, with various instruction sets, to each other and to other models of computation under the unit cost measure 1?10 .
(Let X ? PTIME denote the class of languages accepted by a model X in polynomial time. Let PSPACE and EXPSPACE denote the classes of languages accepted by a Turing Machine in, respectively, polynomial space and exponential space.) Hartmanis or more since the maximum length integer generated under log cost is less than T(n) bits long. Consequently, we examine the contribution of individual instructions to time-bounded computations of the PRAM under the more realistic log cost measure. Section 2 formally de nes the PRAM and introduces assumptions necessary to make precise the computation of a parallel machine under the log cost measure. Section 3 presents simulations involving PRAMs under the log cost measure, relating PRAMs with the basic instruction set and with additional multiplication, division, and shift instructions to each other and to Turing machines and uniform circuits. Section 4 relates time-bounded and processor-bounded PRAMs under the log cost measure to the class NC. Section 5 concludes the paper.
Model De nition
In this section we formally de ne our model and specify the conventions used throughout this paper. Although the PRAM model has been formalized by Fortune and Wyllie 6 , it warrants a detailed de nition to make clear the simulations to follow and due to the introduction of the log cost measure.
A Concurrent-Read, Concurrent-Write (CRCW) Parallel Random Access Machine comprises an in nite number of processors, an unbounded shared memory, and a nite program common to all processors. Each individual processor possesses an unbounded local memory, a program counter, and a read-only processor identity register PID. Each shared and local memory register is capable of holding an integer of arbitrary length. Let P i denote the ith processor of PRAM P. The PID of P i is preset to i. Let An input of size n consists of n bits stored in c 0 at the beginning of program execution. When the execution commences, except for the input and the PID registers, each register of shared as well as local memory holds 0 and all necessary processors are active. The PRAM halts when P 0 executes the HALT instruction.
The time cost of an instruction under the log cost measure is the maximum length of any operand or address handled during that instruction execution. Therefore, operating on an m bit operand at a location with an s bit address takes maxfs; mg units of time. We refer to the execution of one instruction by a processor as a step. Note: one step may consume several time units; di erent processors may begin steps at di erent time units; and di erent processors may run for the same number of time units while executing di erent numbers of steps. Time bound T(n) indicates the number of time units used in a computation (not necessarily the same as the number of instructions executed). When a cost measure is not explicitly mentioned, we imply the log cost measure by default.
For most instructions, the length of the output is no more than twice the length of the larger of the operands. For such instructions, charging time equivalent to the number of bits in the operands (that is, maxflog A; log Bg where A and B are the operands) su ces. But consider the shift instruction, where the output length is the sum of the length of one operand and the value, not the length, of the other operand. In this instance, charging maxflog A; log Bg time does not re ect the number of bits handled by the instruction. So we adopt the convention that the time cost accounts for the output length for instructions where the length of the computed output is far more than the length of the operands. For example, a left shift operation (denoted by "), r i r i " r j , costs maxflog r i ; r j g time to maintain the log cost spirit. Otherwise, the time cost would not truly re ect the amount of work done, and in circumstances where the result of the operand is a nal output and is not accessed later, the resulting time complexity may be incorrect.
In a sequential machine, the log cost and unit cost time measures are just measures of time and play no role in the execution of a program. This does not hold for parallel machines because the time cost measure a ects the timing of processor interaction. When P i is writing an m bit integer in a cell, the write occurs over m units of time. Another processor P j might initiate an attempt to read the same cell during this time interval. In such a situation, the value read by P j is unclear.
To resolve such read and write ambiguities, we make the following assumptions.
For a write of a value x to c k by P i taking m units of time, we view the value x as being accumulated in a hypothetical bu er by P i until the end of the mth time unit. At that time, x is instantaneously loaded to c k . For a read of the contents of c k by P i taking m units of time, the contents of c k are immediately copied to a hypothetical bu er at the start of the m time unit interval. Though nonexistent, assuming such hypothetical bu ers helps to clarify events during overlapping read and write intervals. To illustrate, consider the following instruction executed by P 1 : c k r i , where r i contains an integer m bits long and log i, log k m. Clearly, execution of this instruction will take m units of time. If the execution starts at time T, cell c k retains its old contents until time T + m ? 1. At time T + m, c k 's contents are changed to the new value, which is the contents of register r i at time T. During execution of this instruction by P 1 , if P 2 tries to read c k (for example at time T +m?2), then P 2 will receive the old contents of c k . If P 2 executes the same read operation at time T +m+1, then it will receive the new contents of c k . (Note:
Bounds obtained on simulations in the following sections will still hold under other reasonable assumptions about concurrent accesses to a shared memory cell.) Concurrent write con icts are resolved using the Common write con ict resolution rule. That is, when two or more processors attempt to write to a shared memory cell concurrently (that is, at the same time or during overlapping time intervals), all processors must be writing the same value.
Simulations
In this section we inspect the log cost PRAM more closely. Section 3.1 compares a TM with a PRAM ] and derives bounds for mutual simulations. Section 3.2 presents a simulation of a RAM ] by a RAM ]. The following section extends this simple one-to-one simulation to a simulation of PRAM ] by a PRAM ]. The key to this extension is a lemma that preserves interprocessor timing in a simulation in which some steps of the simulating machine take longer to execute than the corresponding steps of the simulated machine, while others take the same time. PRAMs and uniform circuits. These simulations form the basis for the complexity class relations in Section 4.
PRAM ] vs. TM
Relations between TMs and log cost sequential RAM ]s are well established with an e cient simulation of a TM by a RAM ] by Katajainen et al. 11 and a sublinear time simulation of a one-tape TM by a RAM ] by Robson 12 . In this section, we investigate the relationship between a Turing Machine and a log cost PRAM. In particular, we are interested in establishing a relation between TM space and PRAM time that will be of use in subsequent sections. Proof. This proof is based on the simulation of a TM by a unit cost PRAM ]
given by Fortune and Wyllie 6 . We sketch this simulation to make clear the time complexity for a log cost PRAM ]. Without loss of generality, let M be an S(n) space-bounded TM with a unique nal con guration. Since M uses S(n) space, there can be a maximum of 2 O(S(n)) possible con gurations. M accepts the input if and only if, starting in the initial con guration, it reaches the nal con guration by a sequence of at most 2 O(S(n)) steps. We construct a PRAM ] Q that simulates M using one processor for each possible con guration of M. Each processor unpacks the con guration represented by its PID, computes the successor con guration, and repacks the result into an integer that it stores in the shared memory in a location indexed by PID. Q determines \reachability" by pointer jumping. There can be and T(n) n. (Note that for a log cost PRAM, T(n) n since the cost of just reading the input is n.) The unit cost simulation is recursive, requiring T(n) levels of recursion and space proportional to the maximum operand length (which is O(T(n))) at each level. For a T(n) time-bounded log cost PRAM, the maximum operand length is O(T(n)), irrespective of the kind of instructions present in the instruction set. A TM simulating a log cost PRAM can follow essentially the same simulation as for a unit cost PRAM taking into account the fact that under log cost, a step of a processor may take several time units. Since no processor in the log cost PRAM executes more than T(n) steps, the depth of recursion is still T(n). A TM can compute m bit multiplication or division in space O(m), so the space at each level remains O(T(n)). Proof. Let P be a T(n) time-bounded, P(n) processor-bounded PRAM op], where T(n) and op satisfy the conditions in the theorem statement. We construct a PRAM ] Q that simulates P in O(T 2 (n)) time with P(n) + 1 processors. Q works in two phases. In phase one, the processors construct T(n) and a set of T(n) mask registers, and in phase two, processor Q i simulates processor P i in a step-by-step manner. In the simulation, each time unit of P is mapped to O(T(n)) time units of Q. The extra processor of Q maintains a register indicating the current simulated time unit of P. Q uses the mask registers to evaluate the lengths of operands and addresses in order to determine proper timing for shared memory accesses.
In phase one, the processors compute the value T(n). Each processor then computes a set of T(n) mask registers in which the ith register holds 2 i , for 1 i T(n). The extra processor designates a register in shared memory as the clock register, initializes this location to 0, and communicates this address to all other processors. This completes preprocessing. This phase takes O(T 2 (n)) time.
In phase two, Q simulates the computation of P. Throughout this phase, the extra processor increments the clock register every O(T(n)) time units. (To time itself between successive clock register updates, the extra processor may read a T(n) bit long integer, such as a mask register contents, consuming T(n) time.) This clock register value will ensure the correctness of the order in which shared memory cells are accessed. We refer to the time during which the clock register holds value j as dilated time unit j.
Processor Q i simulates P i step-by-step. For instructions in the basic instruction set, Q i performs the same operation as P i in the same time. For instructions in set op, Q i simulates an instruction that takes time t in P i in time O(t 2 ). For each step of P i , Q i also determines the time spent by P i in executing that step. Q i next computes the maximum length of the operands and addresses involved in a step of P i by using the mask registers to AND the operand (or address) successively with selected masks by binary search. This length is the time t spent by P i to execute the instruction. Q i determines t in time O(t log t). At this point, Q i has spent O(t 2 + t log t) (that is O(t 2 )) time simulating this step of P i . If Q i began this simulation during dilated time unit t j , then Q i is currently in dilated time unit y = t j + O(t 2 =T(n)). Since t < T(n); y < t j + t, so Q i has not yet reached the dilated time unit during which the next step is to be performed. Q i checks the clock register until dilated time unit t j + t is reached, then performs the next step.
Processors interact and communicate through shared memory accesses, so the chronological order of shared memory accesses should be preserved. To ensure the correctness of the simulation, when P i accesses a shared memory cell at time t j , simulating processor Q i must access the cell during dilated time unit t j .
For a shared memory read that starts at time t j , Q i checks the clock register value until it indicates dilated time unit t j , then it reads the desired cell contents. Since the cell addresses and contents must have length less than T(n), this read is completed during dilated time unit t j .
For a shared memory write in a step that starts at time unit t j and consumes t units of time of P i , Q i does the following. As described above, Q i determines that the step takes t time units on P. Q i checks the clock register until dilated time unit t j + t is reached, then performs the write. Since the length of the value being written is less than T(n), the write is completed within the same dilated time unit.
Therefore, Q performs each of the operations of P while maintaining interprocessor timing. The preprocessing phase takes O(T Proof. We implement a division algorithm given by Savage When a processor has unrestricted shifts under unit cost, it can execute A 1 followed by A A " A repeatedly and thereby build enormous numbers. When A A " A is executed repeatedly t times, the numbers built are
Step 1 1 2 0
Step 2 10 2 1
Step 3 1000 2 3 Step 4 100000000000 2 11 .. .. ..
Step t 100000...00000000 2 ftg where 2 ftg represents the value generated after t shifts which is greater than 2 raised to the tth stack power. But since the RAM "; #] functions under the log cost measure, the time required to execute successive shifts will increase proportionally to the value of the shift distance, and so in a log cost RAM "; #] the size of the numbers built remains comparable to the computation time. Proof. We present the simulation for left shift (") using multiplication. Simulation of right shift (#) is performed similarly using division. Let P be a RAM "; #]
with time complexity T(n). We construct a RAM ; ] Q that simulates P. When Theorem 6 Let T(n) be a function that is constructible in O(T(n) log T(n)) time by a PRAM ], and let P(n) T(n). A PRAM "; #] using T(n) time and P(n) processors can be simulated by a PRAM ; ] in O(T(n) log T(n)) time using P(n)+ 1 processors.
Proof. Let P be a T(n) time-bounded, P(n) processor-bounded PRAM "; #].
We construct a PRAM ; ] Q that simulates P much as described in the proof of Lemma 2, but use the and instructions available in Q to speed up operations.
To develop the masks in phase one, Q uses T(n) processors, each of which creates one mask and stores it in the shared memory. Using successive squaring, the masks can be generated in O(log T(n)) steps, each taking at most T(n) time. Thus, Q concurrently generates T(n) masks within O(T(n) log T(n)) time. In phase two, when the instructions are executed, each time unit of P is mapped to c log T(n) time units of Q, where c is a constant, instead of O(T(n)) as in Lemma 2. Q i can perform each operation of processor P i , other than a shift, in the same amount of time as P i . Consider now a left shift by P i , A B " C. . This shift operation takes maxflog B; Cg time units on P i , and the simulation takes O(maxflog B; Cg) time units on Q i (assuming that the addresses are not larger than the operand lengths; if this assumption does not hold, the simulation by Q i still takes only a constant factor more time). As in the proof of Lemma 2, Q i computes the time consumption t of a step of P i in time O(t log t). Assume that Q i starts at dilated time unit t j a step of P i that begins at time t j and takes t steps, ending with a write to a shared memory cell. Q i takes O(t log t) time to simulate this step and compute t, ending at dilated time unit t j + O(t log t)=(c log T(n)). Since t < T(n), t j + O(t log t)=(c log T(n)) < t j + t, so Q i has not yet reached the dilated time unit during which the write is to be performed. Since this write may take up to t time units, and t may be greater than c log T(n), Q i may be forced to initiate the write at a dilated time unit prior to t j + t so that the write is completed during dilated time unit t j + t. Q i computes the length of the value to be written and the length of the address; let z denote the maximum of these two values. Q i computes y = t j + t ? z=(c log T(n)), then checks the global clock until dilated time unit y and initiates the write. Note that z t T(n), so O(t log t) + O(z log z) < ct log T(n) for an appropriate constant c, and so Q i completes its simulation of a step of P i , time computation, and write within the available t dilated time units.
Q spends O(T(n) log T(n)) time in the preprocessing phase, then O(log T(n)) time for each time unit of P in the simulation phase. Hence, Q simulates P in O(T(n) log T(n)) time using P(n) + 
PRAM ] vs. Uniform Circuits
Uniform circuits are a class of parallel machine models constructed using Boolean gates. A family of uniform circuits is an in nite collection of combinational circuits, one for each input size. The size and depth of the circuits form the basic resource measures where the size of the circuit is related to the hardware measure and the depth of the circuit is related to the parallel time requirement to solve a given problem or to execute a given program. In this section we analyze the relation between log-space uniform unbounded fan-in circuits and PRAM ]s.
We use the following de nitions relating to circuits 15 . A circuit family C is a set fC 1 ; C 2 ; :::g of circuits, where C n has n inputs and one output. We restrict the gate numbering so that the largest gate number is (Z(n))
O (1) , where Z(n) is the size of C n . Thus, a gate number coded in binary has length O(log Z(n)). A bounded fan-in circuit is a circuit where the indegree of each gate is at most 2. If C n has at most Z(n) gates and depth D(n), then the size complexity of C is Z(n) and the depth complexity is D(n). An unbounded fan-in circuit is a circuit where the indegree is unbounded. If C n has at most Z(n) wires and depth D(n), then the size complexity of C is Z(n) and the depth complexity is D(n).
We also de ne the following. A family of circuits C of size complexity Z(n) is log-space uniform if, given integers n, g, and i, there exists a TM that uses work space O(log Z(n)) and outputs the type (AND, OR, NOT, inp) of gate g of C n and the number of the gate whose output is connected to the ith input of g. Proof. Let C be a log-space uniform family of unbounded fan-in circuits of depth D(n) and size Z(n). Fix input size n. We construct a PRAM ] P that simulates C n based on the unit cost PRAM simulation by Stockmeyer and Vishkin 8 . Their simulation is described for nonuniform circuits and PRAMs, but if the circuit is uniform, then the PRAM is also uniform. P rst constructs the description of C n . Since C is a log-space uniform family of size complexity Z(n), a TM can construct the circuit description in O(log Z(n)) space. By generalizing Theorem 1 to a simulation of a TM that produces the circuit description as an output, P simulates this TM and produces the description of C n in O(log 2 Z(n)) time.
The remainder of the simulation proceeds as the simulation of Stockmeyer and Vishkin 8 , where the number of steps is proportional to the depth of the circuit. O (1) ).
Proof. Stockmeyer and Vishkin 8 have shown that a unit cost PRAM ] with W(n) word size, P(n) processors, and T(n) time can be simulated by an unbounded fan-in circuit of depth O(T(n)) and size O(T(n)W(n)
). Their simulation is described for nonuniform PRAMs and circuits, but if the PRAM is uniform, then the circuit is also uniform. The same bounds work for the log cost PRAM ], provided the interprocessor communication timing is maintained.
Let P be the simulated PRAM ] and C the simulating family of circuits. A main component of C is a constant depth circuit block capable of executing any one of the instructions available in P's instruction set. We add a subblock that computes the size of the operands.
To ensure proper register updates when executing an instruction costing t time units, the circuit block corresponding to one time unit of P computes the result but delays the update by t units of time. C achieves this postponement by setting a counter to t and passing the result and counter to successive circuit blocks. Each circuit block (corresponding to one time unit of P) decrements the counter. When the counter reaches zero, then the circuit block holding the result updates the proper memory register. Computing the operand length to determine t and setting up and checking a counter require only constant depth circuits. Since P uses P(n)
O (1) ) size and O(T(n)) depth su ce. 2
Complexity Classes
In this section, we attempt to nd a place for the PRAM under the log cost measure in the established hierarchy of classes for traditional models. We de ne the classes NC and AC, then show the place of complexity classes de ned for the log cost PRAM in the hierarchy based on the bounds obtained through earlier simulations. This discussion should give an overall meaning to the individual simulations, comparisons, and results presented in the earlier sections.
The class NC is based on simultaneous utilization of two di erent resources toward program execution 16 . The classes NC and AC (which is a variation of the class NC) are very robust in allowing equivalent characterization by uniform circuits under various uniformity conditions, unit cost PRAMs, and alternating Turing machine models Theorem 9 showed that a T(n) time-bounded log cost CRCW PRAM with P(n) processors can be simulated by a family of log-space uniform unbounded fanin circuits of depth O(T(n)) and size (T (n)P (n)) O(1) , since W(n) T(n). 
Conclusions
In this paper, we carried out a detailed study of a log cost CRCW PRAM ]. We established relations among PRAMs with di erent instruction sets, TMs, and uniform circuits. These containments indicate that while being more realistic, the log cost PRAM is still quite powerful and compares well with other models in terms of performance. Following are some interesting open problems. ) size bound could possibly be eliminated altogether. 4. What are the e ects of augmenting the basic instruction set with other instructions such as exponentiation?
