Let C be a circuit representing a straight-line program on n inputs x 1 , x 2 , . . . , x n . If for 1 i n an arrival time t i ∈ N 0 for x i is given, we define the delay of x i in C as the sum of t i and the maximum number of gates on a directed path in C starting in x i . The delay of C is defined as the maximum delay of one of its inputs.
Introduction
We consider circuits representing straight-line programs and refer to [26, 31] for basic definitions. Every such circuit is a directed acyclic graph whose vertices have been identified either with inputs, outputs or computation steps. The vertices identified with computation steps are called gates. The functions evaluated by gates belong to a finite set of functions, which is called the basis. Two classical measures associated with a circuit C are its size size(C), which is the number of its gates, and its depth depth(C), which is the maximum number of gates on a directed path in C.
One of the main motivations to study circuits is VLSI design where the main optimization issues are area consumption, power consumption and speed. Whereas size is an appropriate measure for area and power consumption, the relation between depth and speed is more problematic, since input signals may arrive at different times. The approach used in the industry to analyze the timing behaviour of a chip is the so-called static timing analysis [5, 6, 9] , which computes estimates for the arrival times for all relevant signals on a chip. This motivates the following definition.
Definition 1.
Given a circuit C with inputs x 1 , x 2 , . . . , x n and given an integer arrival time t i ∈ N 0 = N ∪ {0} = {0, 1, 2, . . .} for x i for 1 i n, the delay of input x i for 1 i n in C is defined as the sum of t i and the maximum number of gates on a directed path in C starting at x i . The delay delay(C) of C is defined as the maximum delay of an input.
This definition perfectly corresponds to the worst-case static timing analysis necessary to guarantee correct functioning of a chip for all inputs: If we replace in Definition 1 the maximum number of gates with the maximum accumulated gate delay, then delay(C) equals the arrival time as calculated by static timing analysis. Therefore, this notion of delay is more appropriate for practical application than average case notions [8] .
Clearly, if C is a circuit for some boolean function f depending on the inputs x 1 , x 2 , . . . , x n with arrival times t 1 , t 2 , . . . , t n ∈ N 0 , then max{depth(C), max{t 1 , t 2 , . . . , t n }} delay(C) depth(C) + max{t 1 , t 2 , . . . , t n }, which implies that the delay of a minimum depth circuit for f is at most twice the optimum delay. To some extent this remark justifies the use of circuits of small depth which are mostly known for a long time (cf. e.g. [1, 10, 17, 27, 30] ) to realize fundamental functions such as addition or multiplication on a chip. Nevertheless, arrival time differences are typically large compared to individual gate delays, and thus speed and reliability of a chip can be improved considerably by taking arrival times into account. For some attempts to do so we refer the reader to [4, 13, 15, [18] [19] [20] [21] 25, 29, 33] . The papers [18] [19] [20] [21] for instance are a good example how arrival time differences are taken into account for the construction of a parallel multiplier. Without providing rigorous performance proofs, the authors describe heuristics to adapt the delay profile of an adder whose inputs are the partial products of the parallel multiplier to the delays caused by this multiplier. Whereas the problems and optimization margins related to uneven arrival time profiles are recognized in these papers, only rather restricted situations are solved without providing a general mathematical framework that captures the relevant problem which is what we propose in the present paper. The reader should be aware that the notion of "delay" as proposed by us in Definition 1 is a simplified model. In fact, it is about the simplest mathematical abstraction capturing the static timing analysis which is "clean" enough to allow a mathematical treatment.
Whereas circuits of minimum depth often display a very regular structure, circuits of minimum delay may look quite irregular even for simple functions. Therefore, apart from having a purely practical motivation, the notion of delay leads to interesting theoretical problems.
In the present paper we prove some fundamental results on the delay. In Section 2, we prove a lower bound and construct circuits of close-to-optimal delay for some classes of functions. In Section 3, we describe circuits solving the prefix problem on n inputs that are of essentially optimal delay and of size O(n log(log n)). Finally, in Section 4, we relate formula size and delay.
A lower bound and simple cases
First we extend a lower bound on the depth due to Winograd [32] .
Proposition 1.
If C is a circuit of fan-in at most r for some boolean function f depending on the inputs x 1 , x 2 , . . . , x n with arrival times t 1 , t 2 , . . . , t n ∈ N 0 , then
Proof. The existence of a circuit C of fan-in at most r and delay T for f implies the existence of a rooted r-ary tree with n leaves of depths It is a simple exercise-leading to an alternative proof of Proposition 2-to show that circuits of minimum delay for functions as in Proposition 2 can also be obtained by a greedy algorithm that iteratively replaces two inputs, say x i and x j , of smallest arrival times t i and t j with a new input of arrival time max{t i , t j } + 1. It is obvious that Proposition 2 and the greedy procedure generalize to r > 2.
The next theorem shows that the functions considered in Proposition 2 are essentially the only ones for which we can always achieve a delay as in (1) 
where y i equals either x i or ¬x i for 1 i n. 1
Proof. The 'if'-part of the statement follows easily from Proposition 2 and we proceed to the proof of the 'only if'-part.
Let f have the described property. By assigning arrival times 1, 1, 2, 3, . . . , (n − 2), (n − 1) to the inputs, it follows that for every permutation ∈ S n there is a representation of f of the form
such that g i depends on both of its inputs for 1 i n − 1. By considering all 10 different boolean functions depending on two inputs and using the relations
where y i equals x i or ¬x i for 1 i n and • i ∈ {∧, ∨, ⊕} for 1 i n − 1. For contradiction, we assume that (3) for all permutations . For each of the four different possibilities this easily implies a contradiction to (3). We leave the details to the reader and give just one example:
which is not true for x 2 or x 3 . This contradicts (3) for permutations with (1) ∈ {2, 3}.
Hence we may assume that (
In both cases, changing the value of x 1 changes the value of f (x 1 , x 2 , x 3 ), which is not true for x 2 or x 3 . Again, this easily implies a contradiction to (3) . Now let n 4. By substituting appropriate constants to all inputs except x i , x i+1 and x i+2 we reduce f in (3)
. Clearly, similar arguments as above imply a contradiction and the proof is complete.
The prefix problem
In this section we consider the so-called prefix problem.
Prefix problem Input: An associative operation
The prefix problem lies at the core of many fundamental problems. The hard part in designing a fast adder, for example, is the calculation of the carry bits, which is equivalent to a prefix problem. It is an easy exercise to construct circuits for the prefix problem of depth log 2 (n) + o(log(n)) and size O(n log(n)) or of depth 2 log 2 (n) + o(log(n)) and size O(n). With a little more effort Ladner and Fischer [14] construct such circuits with depth log 2 (n) + k and size 2n(1 + 1/2 k ) for each 0 k log 2 (n) . None of these constructions can accommodate arrival times.
Our main result in this section is the recursive construction of circuits P (t 1 , t 2 , . . . , t n ) over the basis {•} that solve the prefix problem on n inputs x 1 , x 2 , . . . , x n with arrival times t 1 , t 2 ,…, t n ∈ N 0 which are of close-to-optimal delay and of size O(n log(log(n))). Constructions similar to those described by Liu et al. in [15] yield circuits for the prefix problem with close-to-optimal delay but quadratic size.
Construction of P (t 1 , t 2 , . . . , t n )
For n = 1 the circuit P (t 1 ) consists just of the input vertex x 1 having fan-out 1.
For n 2 we apply the following steps.
Step 1: Partition the set {1, 2, . . . , n} into l := √ n sets
Step 2: For 1 i l we use the following dynamic programming approach to construct a circuit C i over the basis {•} calculating
In what follows C j 1 ,j 2 will denote a circuit calculating 2 consists just of the corresponding input vertex. If j 1 < j 2 , we recursively construct C(j 1 , j 2 ) using one •-gate joining the outputs of two circuits C(j 1 , l) and
•-gates. It will follow from Lemma 1 below that the computation of y i by C i terminates at time t (y i ) with
Step 3: For 1 i l we recursively construct
and use these circuits to calculate all (n i − 1) prefixes on the inputs x j for j ∈ V i \{n 1 + n 2 + · · · + n i }.
Step 4: We construct P (t (y 1 ), t (y 2 ), . . . , t (y l−1 )) to calculate all (l−1) prefixes on the inputs y j for 1 j l− 1 calculated by the circuits constructed in Step 2. Step 5 x k of the circuit constructed in Step 3 using one •-gate which
Step 6: Finally, we join the output (y 1 • y 2 • · · · • y l−1 ) of the circuit constructed in Step 4 with the output y l of the circuit constructed in Step 2 using one •-gate which calculates • n i=1 x i . In Figs. 1 and 2 we illustrate P (t 1 , t 2 , . . . , t n ) for n 4 and P (t 1 , t 2 , . . . , t 25 ). The next lemma proves the claim made in Step 2. For a, a 1 , a 2 , . . . , a n ∈ N 0 let D : i∈N N i 0 → N 0 be defined recursively by D(a) = a and D (a 1 , a 2 , . . . , a n ) = min 1 l n−1 max{D (a 1 , a 2 , . . . , a l ), D(a l+1 , a l+2 , . . . , a n )} + 1.
Lemma 1.
Then
Proof. We start with a series of claims. Let n A , n B ∈ N, l ∈ N 0 , A, A ∈ N n A 0 with A A (componentwise) and B ∈ N n B 0 . In order to simplify our notation we denote the vector (a 1 , a 2 , . . . , a n A , b 1 , b 2 , . . . , b n B ) by (A, B) where A = (a 1 , a 2 , . . . , a n A ) and B = (b 1 , b 2 , . . . , b n B ) . Furthermore, for l 1 let Z(l) denote the vector of l zeros. 
Proof of Claim 2.
Again by induction, we obtain D(Z(2 i )) = i for i 0, which immediately implies the desired result.
Claim 3. D(A, l) D(A, Z(2 l )) and D(l, B) D(Z(2 l ), B).
Proof of Claim 3. We only prove the first inequality. The second follows by symmetry.
For contradiction, we assume that (A, l) is a counterexample of minimum length n A + 1. Z(2 l ) )} + 1 for some non-trivial A 1 and some A 2 with (A 1 , A 2 ) = A, then (4) and the choice of (A, l) imply the contradiction
Therefore, there is some 1 r 2 l − 1 such that
By (4), we have
which is a contradiction. Hence D(A) < l and we obtain the contradiction
and the proof of the claim is complete.
Claim 4. D(A, l, B) D(A, Z(2 l+1 ), B).
Proof of Claim 4. This can be proved similarly to Claim 3 and we leave the proof to the reader.
Altogether we obtain
and the proof is complete.
From the above construction it is obvious that some gates have fan-out up to O( √ n). Similarly, the constructions of Ladner and Fischer [14] lead to large fan-outs. For many practical applications though, a fan-out of l at a gate should actually contribute (log(l)) to the delay of that gate.
In the present situation we model this by using the basis {•, id}, where id : D → D is the identity function, and the following fan-out conditions.
(i) Input vertices and •-gates have fan-out at most 1.
(ii) id-gates have fan-out at most 2.
Next, we construct circuits P (t 1 , t 2 , . . . , t n ) over the basis {•, id} that satisfy Conditions (i) and (ii) and solve the prefix problem on n 2 inputs x 1 , x 2 , . . . , x n with arrival time t i ∈ N 0 for x i for 1 i n.
Construction of P (t 1 , t 2 , . . . , t n )
Starting from P (t 1 , t 2 , . . . , t n ) we apply the following steps.
Step 1: Add one id-gate at input vertices of fan-out 2. (Note that all other input vertices already have fan-out 1.)
Step 2: For 2 i l − 1 add (n i − 1) id-gates at the •-gate calculating y 1 • y 2 • · · · • y i−1 in such a way that they contribute a delay of log 2 (n i ) . (Note that this is clearly possible using balanced binary trees.)
Step 3: Add n l id-gates at the •-gate calculating y 1 • y 2 • · · · • y l−1 in such a way that they contribute a delay of log 2 (n l + 1) .
Step 4: Recursively apply the above changes to the subfunctions of the form P (t 1 , t 2 , . . . , t l ) used in
. . , t n ).
For w n 1 let size(n) and delay(w, n) denote the maximum size and the maximum delay of a circuit P (t 1 , t 2 , . . . , t n ) such that t i ∈ N 0 for 1 i n and w = n i=1 2 t i . Define size (n) and delay (w, n) for P (t 1 , t 2 , . . . , t n ) similarly. We have the following recursions.
Lemma 2. For w n
Proof. Let t 1 , t 2 , . . . , t n ∈ N 0 be such that w = n i=1 2 t i . We use the same notation as during the construction of P (t 1 , t 2 , . . . , t n ) and P (t 1 , t 2 , . . . , t n ). Since n 3, we have 2 n 1 l.
The circuit P (t 1 , t 2 , . . . , t n ) contains (l + 1) subcircuits of the form P (t 1 , t 2 , . . . , t l ) on at most (l − 1) inputs each. To evaluate y i for 1 i l, a number of (n 1 − 1) + (n 2 − 1) + · · · + (n l − 1) = (n − l) •-gates are used. Finally, to compute the remaining outputs (n 2 − 1) + (n 3 − 1) + · · · + (n l − 1) + 1 (n − l) more •-gates are used. This implies the recursion for size(n).
Since the construction of P (t 1 , t 2 , . . . , t n ) from P (t 1 , t 2 , . . . , t n ) recursively adds
id-gates, the recursion for size (n) follows. Now we proceed to delay(w, n) and delay (w, n). We have
As delay(w, n) is obviously non-decreasing in w, the recursion for delay(w, n) follows. Since the construction of P (t 1 , t 2 , . . . , t n ) from P (t 1 , t 2 , . . . , t n ) recursively increases the delay by 1 + max{ log 2 (n 2 ) , . . . , log 2 (n l−1 ) , log 2 (n l + 1)
the recursion for delay (n) follows.
In the next lemma we solve the above recursions.
Lemma 3. (i)
Let s : N → N, , 0 and n 0 ∈ N be such that for n n 0
Then there is some 0 such that for all n ∈ N s(n) n log 2 (log 2 (n)) + s(1).
(ii) Let d : N 2 → N, , 0 and n 0 ∈ N be such that for w ∈ N and n n 0 − 1 the term d(w, n) − log 2 (w) is bounded and for w ∈ N and n n 0
Then there is some
(iii) Let d : N 2 → N, , 0 and n 0 ∈ N be such that for w ∈ N and n n 0 − 1 the term d(w, n) − log 2 (w) is bounded and for w ∈ N and n n 0
Proof. We just prove (i) and leave the analogous proofs of (ii) and (iii) to the reader. We will prove (5) by induction. Let > . Clearly, there is some n 1 n 0 such that for n n 1 and
Let be such that (5) holds for n n 1 − 1. For n n 1 we obtain, by induction, that
and the proof of (5) is complete.
Combining Lemmata 2 and 3 with the obvious fact that log 2 ( n i=1 2 t i ) max{t i | 1 i n} we obtain the main result of this section.
Theorem 2.
The prefix problem on inputs x 1 , x 2 , . . . , x n with arrival times t 1 , t 2 , . . . , t n ∈ N 0 can be solved by (i) a circuit over the basis {•} with size O(n log(log(n))) and delay
(ii) a circuit over the basis {•, id} satisfying the fan-out conditions (i) and (ii) with size O(n log(log(n))) and delay
Furthermore, both kinds of circuits can be constructed in polynomial time.
Clearly, Theorem 2 is most interesting for considerable arrival time differences. In this case log 2 (n) may be arbitrarily small compared to log 2 ( n i=1 2 t i ). Hence, applying well-known methods for fan-out reduction (e.g. [7] ) to P (t 1 , t 2 , . . . , t n ) leads to weaker results than (ii) in Theorem 2.
As we mentioned at the beginning of this section, circuits for the prefix problem can be used to construct adders. Given arrival times, say t 1 , t 2 , . . . , t n ∈ N 0 and t 1 , t 2 , . . . , t n ∈ N 0 , for the bits of two n-bit binary numbers, say x and y, and using a well-known construction (cf. e.g. [14, 31] ) we obtain a circuit over the basis {∨, ∧, ¬} of fan-in 2 for ∨-or ∧-gates and fan-in 1 for ¬-gates calculating the sum of x and y with size O(n log(log(n))) and delay
In view of Proposition 1, the bounds on the delay given in Theorem 2 are close-to-optimal and the bounds on the size are optimal up to a factor of O(log(log(n))). The best known adders are of depth log 2 (n) + O( log(n)) and size O(n log(n)) [1] or size O(n) [10] , respectively. The adder developed in [33] , which takes arrival times into account, has size O(n log(n)), but no delay bound has been proved.
Formula size and delay
In this section we extend a well-known type of result relating formula size and depth. The first such result was proved by Spira [28] , whose original idea underwent numerous variations [2, 3, 11, 12, 16, 22, 23] . Most of these can be generalized from depth to delay similarly to the next theorem.
The following proof relies on restructuring a given formula using of the so-
Since most standard cell libraries in VLSI design contain a primitive gate for this function, the proof can easily be turned into a practical strategy to speed up a late signal on a chip by applying the restructuring step to some part of its fan-in cone. Then there is a circuitC for f over such that
Proof. We prove the result by induction over n. Let w = n i=1 r t i and note that /(log r (r
Therefore, there is some ∈ N independent of C and f such that (8) holds for n = 1. We may assume that . Now let n 2. The directed graph underlying C is a rooted tree T whose leaves are the inputs x 1 , x 2 , . . . , x n . For every vertex u of T let w(u) denote the sum of r t i where the sum extends over all i such that x i lies in the subtree of T rooted at u.
Let the vertex u be chosen such that: (i) w(u) > w/(r + 1), (ii) w(u) is minimum subject to (i) and u has maximum distance from the root subject to (i) and (ii). It is easy to see that w(u) < w. Let C u denote the subcircuit of C corresponding to the subtree of T rooted at u. For i ∈ {0, 1} let C i denote the circuit that arises from C by replacing the output of u by the constant i.
Since C u , C 1 and C 0 are circuits for functions defined on at most (n−1) inputs, we can apply the induction hypothesis to them. This implies the existence of circuitsC u ,C 1 andC 0 over for the same functions whose delay is bounded as in (8) .
Note that if g denotes the function computed at the vertex u, then clearly f = sel(g, f | g=1 , f | g=0 ). Therefore, using C u ,C 1 ,C 0 and the circuit for sel over , we can construct a circuitC for f over such that delay(C) + log r (r + 1) − 1 max{log r (w(u)), log r (w − w(u))} + .
By the choice of u, we have w − w(u) wr/(r + 1). If w(u) wr/(r + 1), then (9) implies delay(C) + log r (r + 1) − 1 log r wr r + 1 + = log r (r + 1) − 1 log r (w) + .
Hence, we may assume that w(u) > wr/(r + 1). In this case u must be a leaf of T and C u has delay log r (w(u)). Therefore, we can strengthen (9) as follows:
delay(C) + max log r (w(u)), log r (r + 1) − 1 log r (w − w(u)) + .
If the right-hand term yields the maximum in (10), then we can proceed as before. Hence, we may assume that the left-hand term yields the maximum in (10) and trivially we obtain delay(C) + log r (w(u)) log r (r + 1) − 1 log r (w) + , which completes the proof.
Conclusions
Motivated by the use of circuits as a mathematical model in VLSI design we proposed the notion of delay. It naturally extends the notion of depth using information provided for example by static timing analysis. Several engineering publications and industrial trends show that chip designers are becoming aware of the need for such a notion [4, 10, 15, [18] [19] [20] [21] 29, 33] .
We proved several fundamental results about delay and described algorithms leading to circuits of small delay. The general strategies used in these algorithms can clearly be applied to a variety of problems that are both of theoretical and practical interest.
The definition of delay grew naturally out of a close ongoing cooperation between our own institute and the IBM company that has been lasting for more than 18 years. Combined with realistic library dependent delay estimates, gate sizing, repeater tree construction, placement/legalization and routing, algorithms [25] based on the theoretical results presented here are currently being implemented as part of our Bonn Tools, which are design automation tools developed at our institute for industrial use.
