We study the relatively old problem of asymptotically reducing the runtime of serial computations with polynomial size Boolean circuits. To the best of our knowledge, no progress on this problem has been formally reported in the literature for general computational models, although we observe that early work of Chandra, Stockmeyer, and Vishkin implies the existence of non-uniform unbounded fan-in circuits of t O (1) size and O( t log log n ) depth, for time t Turing machines. We give an algorithmic size-depth tradeoff for parallelizing time t random access Turing machines, a model at least as powerful as logarithmic cost RAMs. Our parallel simulation yields logspace-uniform t O(1) size, O(t/ log t) depth Boolean circuits having semi-unbounded fan-in gates. In fact, for appropriate d, uniform
INTRODUCTION
A fundamental problem in Complexity Theory is to determine the extent to which arbitrary time-bounded computations can be parallelized. For example, the NC = P question asks if polynomial time computations can be captured by (log n) O(1) depth, bounded fan-in, polynomial size circuits. Over the years, numerous parallel simulations have been given for speeding up general (serial) computational models [13, 14, 3, 9, 10, 8] , each of which can be interpreted as a Boolean circuit family, where the number of processors in the parallel algorithm is polynomially related to the circuit size of the family. However, no known parallel simulation achieves an ω(1) speedup (i.e. circuits of depth o(t) for a time t serial algorithm, respectively) with only a polynomial number of processors (gates, respectively), cf. [4, p.79] . On the other hand, Meyer auf der Heide [11] has shown that restricted RAMs (lacking a full set of operations and indirect addressing) can indeed be sped up by a log t/(log log t) 2 factor using t O(1) processors, in a similarly restricted PRAM model.
In 1985, after presenting various speedups of deterministic Turing machines on parallel machines with an exponential number of processors, Dymond and Tompa [3] concluded that it would "be interesting to know what speedups can be achieved using a number of processors that is only polynomial in the time bound." While this is certainly an interesting theoretical question, it could also be of practical import, perhaps more so than parallel simulations which obtain a faster speedup but require exponentially many processors.
The only result we could find directly addressing this question is from Mak [10] in 1997, who shows a time t RAM can be sped up to εt + n by a CREW PRAM of t O(1) processors, for ε > 0.
1 However, a scour of the literature revealed that work of Chandra, Stockmeyer, and Vishkin [2] yields nonuniform circuits of o(t) depth and polysize for time t Turing machines. They prove that any bounded fan-in circuit of size n O(1) and depth d ≥ log n has an equivalent unbounded fan-in circuit of n O(1) size and depth O(d/ log log n). Briefly, [2] observe that a bounded fan-in circuit with depth ε log log n and a single output gate can be seen as a Boolean function taking O((log n) ε ) inputs, which in turn can be represented by an unbounded fan-in depth-2 circuit (in particular, a DNF) of size 2 O((log n) ε ) . Every ε log log n levels of the given depth d circuit can therefore be replaced with a collection of 2 O((log n) ε ) size, depth 2 circuits. This results in an unbounded fan-in circuit of polynomial size and O(d/ log log n) depth, albeit one that is non-uniform. As it is well-known that time t Turing machines can be simulated by bounded fan-in circuits of depth O(t) and size O(t 2 ), the above transformation results in a simulation using O(t/ log log n) depth, unbounded fan-in polynomial circuits. Theorem 1. Let L be a language recognized by a Turing machine using time t(n). Then L is recognized by a circuit family having unbounded fan-in, t O(1) (n) size, and O(t/ log log n) depth.
To obtain a more general, uniform circuit construction of smaller depth, we take a different approach. We show how to simulate time t random access Turing machines with uni- 
PRELIMINARIES
We use the definition [n] := {1, . . . , n}. We assume familiarity with basic concepts of complexity theory, though we will briefly review some notions required for this paper. Throughout, we implicitly assume all functions considered are those efficiently constructible under the appropriate resources.
Circuit Uniformity. In this paper, our circuit constructions will be logspace-uniform. A circuit family {Cn} is logspace-uniform if there is a Turing machine M that, given 1 n as input, writes a description of Cn on its output tape. Furthermore, M (1 n ) runs in O(log n) space. Deterministic Machine Model. Our main result simulates random access Turing machines. These are Turing machines with a fast access mechanism. This machine model is fairly powerful in that such machines can simulate logarithmic cost RAMs with only a constant factor overhead in runtime [15] . A random access TM has, along with the usual Turing machine equipment, k + 1 write-only binary index tapes, one for the input and k for the worktapes. To access the ith cell of the input or a worktape, one writes i in binary on the respective index tape (in O(log i) steps, if the index is blank), and switches to a special "access" state which moves the respective head to the ith cell of the respective tape in one timestep.
Background on Alternation
It will be convenient to describe circuits with alternating machines, a natural model of parallelism [1] .The reader may refer to Papadimitriou [12] for definitions. ATIME[t] and ASPACE[s] denote the classes of alternating machine time t and space s, respectively; ATISP[t, s] denotes the sets accepted by alternating machines using both time t and space s simultaneously. Σa SPACE[s] denotes the class of sets accepted by alternating machines taking at most a alternations and O(s) space in every branch. We use the following relation between alternating computations and circuits, sketching its proof for completeness.
Theorem 2 (Generalizes [16] and [7] ). For a(n) ≥ s(n) ≥ log n, any language in Σ a(n) SPACE[s(n)] is accepted by an unbounded fan-in, logspace-uniform circuit family of O(a(n)) depth and a(n) · 2 O(s(n)) size.
Proof. The below proof is due to Immerman [7, p.87] . Let M be a s(n) space machine making a(n) alternations. Let C1, C2 denote configurations of M on an n-bit input. Define E(C1, C2, x) to be a circuit that outputs 1 iff from C1 there is a computation on input x that reaches C2 through existential states only. Similarly define A(C1, C2, x) for universal states. E and A accept languages in NSPACE[s(n)] and coNSPACE[s(n)], respectively, and thus can be constructed uniformly, with unbounded fan-in, 2
O(s(n)) size, and O(s(n)) depth. Let I and Ac be the unique initial and accepting configuration for M on n-bit inputs. Now define a circuit ACCEPT via the following inductive definition:
It is clear that ACCEPT(I, a(n)) is true iff M (x) accepts. Moreover, the resulting circuit is extremely regular, even logspace-uniform. Its size is dominated by the guessing of C2, which requires 2 O(s) gates for each i = 1, . . . , a(n).
Semi-unbounded Fan-in Circuits.
A semi-unbounded fan-in circuit family has circuits where only OR gates are unbounded (AND gates have bounded fan-in). In the uniform setting, semi-unbounded and unbounded fan-in can make a difference; cf. Vollmer [17] . We note that the above theorem can be tweaked to get a result for semi-unbounded circuits, which we will use later.
Corollary 1. For a(n) ≥ s 2 (n) and s(n) ≥ log n, any language in Σ a(n) SPACE[s(n)] is accepted by a uniform circuit family of semi-unbounded fan-in, O(a(n)) depth, and 2 O(s(n)) size.
Proof. Circuits E and A can be made to have bounded fan-in, size 2 O(s(n)) , and depth O(s 2 (n)). Then the circuit for ACCEPT is only unbounded in ORs: namely, in the choice of C2.
It follows that, to prove the claim of the introduction, it suffices for us to show DTIME[t] ⊆ Σ t ε log t SPACE[ε log t].
Review of CRCW PRAMs
We assume a RAM model with the usual set of operations. We will work with the Concurrent Read, Concurrent Write (CRCW) PRAM. In such a model, we have a collection of RAM processors P1, . . . , P k , . . . running a common program in parallel; the only difference between the processors is an instruction that allows Pi to load its own number i into a register. Each processor has its own local memory, and they all share a global memory. Global memory locations can be concurrently read and concurrently written, in that when more than one processor wants to write to the same location, we imagine a memory control that simply allows the processor of lowest index to write and ignores the rest (this is sometimes called the PRIORITY CRCW model).
Two measures of time are typically used for RAMs and PRAMs: unit cost and logarithmic cost. In the unit cost measure, every instruction takes one unit of time. Such a model can be "abused" in that it permits, for example, the addition of arbitrarily large integers in constant time, provided they are already present in memory. We use the logarithmic cost measure, which is more realistic but context sensitive: every instruction on integers i and j takes log i + log j units of time.
MAIN RESULT
We shall now prove that uniform semi-unbounded fanin circuits of [1] . This is a departure from all other parallel speedups of deterministic time that we are aware of, which instead build upon the proofs that DTIME[t] ⊆ ATIME[o(t)] (cf. [14, 3] ) or DTIME[t] ⊆ SPACE[t/ log t] (cf. [6, 15, 5] ), for a variety of computational models.
In the proof of DTIME[t] ⊆ ASPACE[log t], a time t onetape Turing machine M is simulated by an alternating machine that runs M "backwards", one step at a time. More precisely, the simulation starts in the accepting state of M and the tape head at the leftmost tape cell (presumed to be blank) at timestep t. It then existentially guesses the content of adjacent cells and the transition taken in the previous timestep, then universally verifies each cell guess recursively. When translated to Boolean circuits, this construction yields circuits of t O(1) size and O(t) depth. To reduce depth, we can guess several transitions at a time, in blocks. In parallel, we can verify the correctness of transitions within the current block, as well as chronologically earlier blocks that affect the current block. By properly choosing the block size and increasing the fan-in of gates, we simultaneously reduce the circuit depth and increase the size by at most polynomial. With some care, this approach allows us to account for random accesses as well as sequential ones.
Theorem 3. For t(n) ≥ n and a(n) such that log n ≤ a(n) ≤ t(n)/ log t(n),
Proof. Without loss of generality, we suppose a(n) divides t(n). For simplicity, assume the given random access TM M has only one read-write tape which initially contains the input and t(n) blanks. The proof can easily be extended to any (finite) number of random access tapes.
Define the local action of M on input x at step i ∈ [t] to be (i, r, I), where r is the transition taken and I is the index tape content of M (x) at step i. 3 Clearly, a local action takes at most O(log t) bits to describe. Let SM,x be the unique string of the form
where i is the local action of M (x) at step i · t(n)/a(n) + 1, and rj is the vector of the next t(n)/a(n) − 1 transitions taken by M (x) starting at step j · t(n)/a(n) + 2. By assumption on a(n), |SM,x| is O(a(n) log t(n)+t(n)) = O(t(n)) bits. Define a block to be a substring of SM,x of the form i ri. Our alternating machine A to simulate M will essentially 3 In the general case where M has k tapes, a local action would contain k index tape contents, e.g. I1, . . . , I k . reconstruct SM,x, one block at a time. The machine A uses two recursive procedures, VERIFY and LAST-WRITE, with the specifications:
• VERIFY(b, i) accepts iff b is the ith block of SM,x, and
• LAST-WRITE(I, i, σ) accepts iff σ is written in the chronologically last timestep where I is the index tape, over timesteps 1, . . . , i · t/a.
The procedures contain a number of deterministic checks, each of which are implementable in logarithmic space. We presume if any check fails that the machine rejects the current branch. We also presume that a recursive call erases all worktape content, except for t(n) and a(n) (written in binary), and the relevant arguments for the call.
Machine A on input x initially sets a variable i to a(n), existentially writes b to tape, and calls VERIFY(b, i).
If i = a(n), check that the last transition of r leads to an accept state.
If i = 1, accept iff starts in the initial state, and all symbol reads in b are either blanks, appropriate bits of x, or symbols previously written in b.
Assume that the symbols read in r are correct, and that is correct. Under these assumptions, check that all state changes, symbols written, and index tape modifications in b are correct, i.e. the changes correspond to transitions of M .
Verify the pieces of the block in parallel:
Universally choose j ∈ [t(n)/a(n)].
Verify is correct:
If j = 1, let q, σ, and I be the state, symbol read, and index tape of , respectively. Universally:
• Check the last write to the Ith position was σ Call LAST-WRITE(I, i − 1, σ) AND • Check q and I are correct
Existentially guess block bi−1. Check that the last transition of bi−1 changes the state and index tape to the state and I in . Call VERIFY(bi, i − 1).
Verify the correctness of symbols read in r:
If j > 1, let σ be the symbol read in transition r[j − 1]. Existentially guess the index tape I for the timestep corresponding to transition r[j − 1]. Check I is correct by simulating the index tape starting from , using the transitions r [1] , . . . , r[j−2]. Along the way, check if I is the index tape for more than one timestep during this simulation.
If it is, check that σ was written in the last such occurrence of I prior to r[j − 1].
If it is not, call LAST-WRITE(I , i − 1, σ).
That completes VERIFY. Now we describe LAST-WRITE.
LAST-WRITE(I, i, σ): If i = 1, then accept iff σ is a blank or the appropriate input symbol of x.
Existentially guess block bi = i ri. Universally:
• Call VERIFY(bi, i) AND • Check if the index tape is ever equal to I, when simulating the index tape starting from i via the transitions in ri. If so, accept iff σ is written in the last transition such that I is the index tape. Otherwise if the index tape is never I, call LAST-WRITE(I, i−1, σ).
This completes the description of A. We now argue that A executes within the desired resources.
It is routine to verify that each deterministic check in VERIFY and LASTWRITE can be performed in deterministic O(log t(n)) space, as each one maintains at most O(log t(n)) bits of information in its simulation of a block. Therefore these checks affect neither the number of alternations nor the overall space bound of A.
The number of alternations used by A can be discerned by induction on i. Observe each call of VERIFY(·, i) depends only on VERIFY(·, j) and LAST-WRITE(·, j, ·) calls, where j < i. A similar observation holds for LAST-WRITE. Inspection shows that, between every two consecutive calls of VERIFY or LAST-WRITE, at most a constant number of alternations occur. It follows by induction that VERIFY(b, i) and LAST-WRITE(I, i, σ) use at most O(i) alternations.
The proofs of correctness of VERIFY and LAST-WRITE are similarly straightforward, by induction on i. Recall that the string of local actions SM,x is unique for each x. Note when a block is guessed and applied, we recursively verify its correctness in a separate branch. It follows that every block bi guessed in an accepting computation is the unique ith block within SM,x.
The following is immediate from Corollary 1 and the above theorem.
Corollary 2. For a(n) and t(n) such that t(n) ≥ n and
2 , deterministic time t random access Turing machines can be simulated by uniform t O(1) · 2 O(t/a) size circuits of semi-unbounded fan-in and depth O(a(n)).
Hence, there exists a spectrum of size-depth tradeoffs for simulating time with semi-unbounded circuits. In particular, semi-unbounded fan-in circuits of polysize and depth O(t/ log t) are possible. Hence, a PRAM model where massive OR-parallelism is cheap but AND-parallelism is expensive would still be capable of performing our simulation.
Carrying out the simulation on CRCW PRAMs
Corollary 3. The simulation of Theorem 3 can be performed by a logarithmic cost CRCW PRAM in O(t/ log t) time. Therefore every log-cost time t RAM can be simulated by a log-cost CRCW PRAM in O(t/ log t) time with only t O(1) processors.
Proof. (Sketch) Our proof is in the spirit of Stockmeyer and Vishkin [16, Theorem 2] . Make a processor Pi,j for each wire (i, j) in the circuit from Theorem 3 simulating a given time t RAM, where i is the gate of lesser depth (the longest path from an input bit to i is less than the longest path from an input to j). 4 Gate i in the circuit will correspond to the ith global memory location gi. All gis are all initially blank, except for those corresponding to input bits.
At its start, Pi,j loads i and j into registers and determines the depth d of i. (To ensure this can be done quickly, we may encode i such that d is a prefix in i's binary encoding.) Next, Pi,j computes log d, and with it, d/ log d as well. Observe these can be computed from a binary representation of d in (log d) O(1) time. Then Pi,j counts up to d/ log d, by repeatedly decrementing 1 from the register that initially holds d/ log d. The point is that this decrementing stage takes Θ(d) time on a logarithmic cost RAM 5 , so that Pi,j waits just long enough (within a constant factor) for an answer to arrive at memory location gi.
Without loss of generality, assume gate i is an OR and gate j is an AND-the proof is symmetric for the other three cases. After the Θ(d) time delay, Pi,j checks if a 1 was written to gi. (If gate i was instead an AND gate, Pi,j would have checked for a blank.) If a 1 was written, Pi,j writes nothing to gj. (If j was an OR, Pi,j would have written a 1.) Arguing inductively, we find that that no write occurs to (the OR gate) gi if and only if gate i gets set to 0 in the circuit evaluation, and no write occurs to (the AND gate) gj if and only if gate j gets set to 1. NOT gates are handled in the obvious way.
By the uniformity of the circuit simulation, the above is doable by a CRCW PRAM of logarithmic cost where every processor runs in time O(t/ log t).
It is instructive to compare the above corollary to Mak [9] , who showed that every log-cost time t RAM can be simulated by an alternating log-cost RAM in O(t log log t/ log t) time. His proof mimics ideas from the time versus space work of Halpern et al. [5] on pointer machines. However, like all other work we have found in this direction, his simulation requires a superpolynomial number of processors, hence the resulting circuit requires at least this many gates.
CONCLUSION
We have of course given only a modest partial answer to the problem of parallelizing deterministic time with polynomial circuits. The obvious next step would be to achieve bounded fan-in circuits with similar size-depth parameters, if possible. This would imply, among other things, that the simulation can also be performed with a CREW (Concurrent Read, Exclusive Write) PRAM. As our simulation is quite different from past work, it is possible that some combination of our ideas with a simulation based on DTIME[t] ⊆ ATIME[t/ log t] could yield a better size-depth tradeoff.
ACKNOWLEDGEMENTS
The author appreciates the very useful commentary from anonymous referees for SPAA.
