We present an algorithm that computes a Boolean circuit for an AND-OR path (i.e
Introduction
An AND-OR path is a Boolean formula of type t 0 ∧ (t 1 ∨ (t 2 ∧ (. . . t m−1 ) . . . ) or t 0 ∨ (t 1 ∧ (t 2 ∨ (. . . t m−1 ) . . . ) .
We assume that for each Boolean input variable t i (i ∈ {0, . . . , m − 1}), an arrival time a(t i ) is given. Our goal is to find a Boolean circuit over the basis {∧, ∨} that computes the Boolean function of a given AND-OR path and minimizes the maximum delay of the inputs. Here, the delay of an input t i in a Boolean circuit is its arrival time a(t i ) plus the length of a maximum directed path in the circuit that starts at t i . Thus, the concept of delay minimization generalizes the concept of depth minimization. When only depth is considered, one has to assume that all input signals are available at the same time (uniform arrival times), which is not the case on real-world chips. Hence, taking non-uniform arrival times into account leads to a much more realistic problem formulation.
Author Upper bound on delay

Size
Maximum fanout [9] ; [5] 1.441 log 2 W + 2.674 O(m) 2 [8] ; [11] (1 + ε) log 2 W + 3 ε + 5 O( m ε ) 2 [8] ; [11] (1 + ε) log 2 W + [11] log 2 W + 2 2 log 2 m + 6 O(m log 2 m) 2
[11] log 2 W + 2 2 log 2 m + 6 O(m) 2 √ 2 log 2 m + 1 Here log 2 W + log 2 log 2 m O(m log 2 m log 2 log 2 m) log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 5 + log 2 log 2 log 2 m + 4.1 Table 1 : Known upper bounds on delay of AND-OR paths with non-uniform arrival times.
More specifically, computing fast AND-OR paths is, up to a small additive constant, equivalent to constructing fast adder circuits (see, e.g., Grinchuk [4] ). For k-bit adders, the carry bit computation is essentially the evaluation of an AND-OR path of length 2k. Once all carry bits are known, the sum can be computed with an additional delay of 2. Adders with non-uniform input arrival times occur, e.g., as a part of multiplication units (see [14] ).
Another application of AND-OR path optimization is the comparison of binary numbers since lexicographic comparison can be expressed as an AND-OR path (see, e.g., Grinchuk [4] ).
Previous Work
For AND-OR paths with uniform arrival times (i.e., only depth is considered), Grinchuk [4] proposed an algorithm computing a Boolean circuit with depth log 2 m + log 2 log 2 m + O(1). This is close to the best known lower bounds on depth: Khrapchenko [7] showed that any circuit for an AND-OR path has a depth of at least log 2 m + 0.15 log 2 log 2 log 2 m + Θ(1). The result is based on a lower bound of Θ m log 2 m log 2 log 2 m log 2 log 2 log 2 log 2 m on the product of size and depth of a Boolean formula for an AND-OR path (see [2] ). For monotone circuits, i.e., circuits without negations (and the circuit built in [4] is a monotone circuit), this lower bound can be improved to Θ(m log 2 2 m) (see [1] ). This directly implies a lower bound of log 2 m + log 2 log 2 m + Θ(1) on the depth of a monotone circuit for an AND-OR path.
For AND-OR paths with non-uniform arrival times a(t 0 ), . . . , a(t m−1 ), the value log 2 W is a lower bound on the achievable delay, where W := m−1 i=0 2 a(ti) (see, e.g., [9] ). No stronger lower bounds on delay are known. Rautenbach et al. [9] presented an algorithm computing a Boolean circuit for an AND-OR path with delay at most 1.441 log 2 W + 3. This delay bound was improved to 1.441 log 2 W + 2.674 by Held and Spirkl [5] . In both of these circuits, so-called 2-input prefix gates are used, and it can be shown that any AND-OR path realization based on prefix gates has a delay of at least log ϕ m−1 i=0 ϕ a(ti) where ϕ = 1+ √ 5 2 ≈ 1.618 is the golden ratio (see [5] ). In particular, this implies that any prefix-based AND-OR path realization has a depth of at least 1.44 log 2 n − 1. Without using prefix gates, Rautenbach et al. [8] presented a circuit for an AND-OR path with delay at most (1 + ε) log 2 W + c ε (for any ε > 0), where c ε is a number depending on ε only. The analysis by Spirkl [11] specified the delay bound to (1 + ε) log 2 W + 6 ε + 8 + 5ε and improved it to (1 + ε) log 2 W + 3 ε + 5. Moreover, Spirkl [11] described a circuit with a delay of at most log 2 W + 2 2 log 2 m + 6. Up to now, this was the fastest circuit for AND-OR paths with non-uniform arrival times. Table 1 summarizes these results in comparison with our delay bound. We also state size (i.e., number of gates) and maximum fanout of the constructed circuits. Note that some methods trade off size against fanout and provide two different circuits.
Our Contribution
In this paper, we present an algorithm with running time O(m 2 log 2 m) that computes a monotone Boolean circuit for AND-OR paths (with m ≥ 3) with a delay of at most log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 5 , size O(m log 2 m log 2 log 2 m) and maximum fanout log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 4.1. For each m ≥ 3, this yields a better delay bound than the previously best known bound of log 2 W + 2 2 log 2 m + 6 by Spirkl [11] . The construction of the circuit is based on a recursive approach similar to the algorithm of Grinchuk [4] for uniform arrival times. The rest of the paper is organized as follows. In Section 2, we introduce basic definitions and results. We give a formal description of the problem (Subsection 2.1), define splitting steps which allow us to partition an instance into smaller sub-instances (Subsection 2.2) and introduce a measure for deciding which instances admit an AND-OR path realization with a given delay (Subsection 2.3). Section 3 classifies these instances, which is the major step of the paper. In Section 4, we deduce how to construct circuits realizing AND-OR paths with a delay of at most log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 5 and analyze the size and fanout as well as the runtime needed to compute such circuits.
Preliminaries
Problem Formulation
We denote the set of natural numbers including zero by N. Our notation regarding Boolean functions and circuits is based on Savage [10] . Given r ∈ N and a Boolean function h : {0, 1} r → {0, 1} with Boolean input variables (shorter, inputs) x 0 , . . . , x r−1 , we write x = (x 0 , . . . , x r−1 ) as a shorthand for all inputs with fixed ordering. If r = 0, we write x = (). Definition 1. Let Boolean input variables t = (t 0 , . . . , t m−1 ) for some m ∈ N with m > 0 be given. We call the recursively defined functions
AND-OR paths.
We assume that each input variable is associated with a prescribed arrival time.
Definition 2. Let r ∈ N and a Boolean function h : {0, 1} r → {0, 1} on Boolean input variables x = (x 0 , . . . , x r−1 ) with arrival times a : {x 0 , . . . , x r−1 } → N be given. Consider a circuit C computing h. For i = 0, . . . , r − 1, the delay of input x i is defined as a(x i ) + l(x i ), where l(x i ) denotes the maximum number of gates of any path in C starting at input x i . The delay of the circuit C is the maximum delay of any input.
We aim at finding circuits computing AND-OR paths with minimum delay.
Recursive Circuit Construction
We construct fast circuits realizing AND-OR paths in a recursive way. Before describing all details of the approach, we explain the idea of the induction step. Assume we want to realize the 
AND-OR path
We subdivide the inputs t 0 , . . . , t m−1 into two groups t 0 , . . . , t 2k and t 2k+1 , . . . , t m−1 . Recursively, we compute fast circuits for the AND-OR paths on each of these input sets. These two circuits can be combined to a circuit for the whole AND-OR path as illustrated in an example with m = 12 and k = 3 in Figure 1 ; the general construction is described below. The output of the circuit for the subinstance t 2k+1 , . . . , t m−1 is combined with every second input of t 0 , . . . , t 2k by using only ∨-gates. Just one additional ∧-gate (labeled with "G" in the picture) is needed to compute a circuit for the whole AND-OR path.
It is not too difficult to check that the circuits in Figure 1 (a) and (b) are logically equivalent. Note that while the left input of gate G in the example is the output of an AND-OR path, the right input of G is the output of a function combining the AND-OR path for t 2k+1 , . . . , t m−1 with a multiple-input OR we can optimize as one entity. The occurrence of such functions in our recursion requires generalizing the concept of AND-OR paths.
Definition 3. Given n, m ∈ N, m > 0, and inputs s 0 , . . . , s n−1 , t 0 , . . . , t m−1 subdivided in s = (s 0 , . . . , s n−1 ) and t = (t 0 , . . . , t m−1 ), we define the extended AND-OR paths
where f (s, t) = g(t), f * (s, t) = g * (t) in the case that s = (). We call the input variables s symmetric inputs and the input variables t alternating inputs, respectively.
We shall always assume that the set of input variables contained in s and t are disjoint sets indexed by s 0 , . . . , s n−1 and t 0 , . . . , t m−1 . Note that expanding the definition yields
where, for m odd, the innermost operation of f (s, t) and f * (s, t) is ∨ and ∧, respectively, and vice versa for m even. Due to the duality principle of algebra, over the basis {∧, ∨} any realization for f (s, t) yields a realization of f * (s, t) and vice versa by switching all AND and OR gates. In order to compute fast realizations for f and f * , we will apply well-known methods that allow realizing f and f * recursively. First, if n > 0, the definition of f (s, t), see Definition 3, yields two symmetric splits:
(1)
In each split, we extract a subset of logically equivalent inputs. We construct a circuit realizing a binary tree on these inputs, while for g(t) and g * (t 1 , . . . , t m−1 ), we obtain a realization recursively. By combining both circuits accordingly, we obtain a circuit for f (s, t).
Definition 4. For r ∈ N and inputs x = (x 0 , . . . , x 2r ), we write x := (x 1 , x 3 , x 5 , . . . , x 2r−1 ).
Secondly, we define the alternating split as
where t is an odd-length prefix of t, i.e., t = (t 0 , t 1 , . . . , t 2k ) for some integer 0 ≤ k < m−1 2 , and t = (t 2k+1 , . . . , t m−1 ). A proof for the fact that f (s, t) and f (s, t ) ∧ f * ( t , t ) define the same logical function can be found in Grinchuk [4] , Lemma 3. Figure 1 shows an example for the alternating split (3) in the case that s = ().
Note that dualizing yields analogous splits for f * .
Delay and Weight
We will realize extended AND-OR paths with good delay using the recursive methods defined in Subsection 2.2. For this, we need to classify inputs by their weight.
Definition 5. Given input variables s and t with arrival times a, we define d(s, t, a) to be the minimum delay of any circuit realizing f (s, t) or f * (s, t) over the basis {∧, ∨}.
Definition 6. Given inputs x = (x 0 , . . . , x r−1 ) and arrival times a, for i = 0, . . . , r − 1, the weight of input x i is W (x i ) := 2 a(xi) , and the weight of x is W (x) := r−1 i=0 W (x i ). Since for given inputs x, the input arrival times a can be derived from the input weights W by taking logarithms, we shall sometimes only mention W instead of a.
For symmetric binary functions, i.e., AND trees or OR trees, the optimum delay achievable for any realization can be derived from the weight of the inputs directly.
Remark 7.
A binary tree on inputs x = (x 0 , . . . , x r−1 ) with weight w := W (x) can be realized with delay d if and only if w ≤ 2 d , see Golumbic [3] . A delay-optimum tree can be computed in runtime O(r log 2 r) via a greedy algorithm based on Huffman coding [6] . A short proof of this can be found in Werber [12] .
Note that when t has at most 2 entries, f (s, t) is a symmetric binary AND tree. 
When s = (), we aim at giving an upper bound on d(s, t, a) similar as in Grinchuk [4] . For the case of uniform input arrival times, Grinchuk fixes a depth bound d and the number n of symmetric inputs s, and determines how many alternating inputs t an AND-OR path may have such that f (s, t) can be realized with depth d. Similarly, given symmetric inputs s with a fixed weight w and a fixed delay bound d, we will determine for which alternating inputs t we have d(s, t, a) ≤ d. Since it is difficult to classify these t exactly, we distinguish different alternating inputs t by their weight only. Note that Definition 1 allows s to be empty, but not t. Thus, W (t)
For larger values of d and w, we will apply the recursive techniques presented in Subsection 2.2 in a specific way in order to bound v(d, w) from below in Section 3. In Section 4, this will allow us to deduce the desired upper bound on the delay of AND-OR paths.
A Lower Bound on v(d, w)
We aim at proving the following statement:
Recall that v(d, w) is well-defined for 0 ≤ w < 2 d , thus in particular for 0 ≤ w < 2 d−1 . For proving Theorem 11, we need to show that for any s with W (s) = w and any t with
, the AND-OR path f (s, t) can be realized with delay d. For this, we would like to proceed by induction on d making use of the restructuring formulas presented in Section 2.2. It turns out that the induction step is not viable (see Remark 22) unless we aim at proving a stronger statement which treats the last two elements of t in a special way.
Definition 12.
Let m ∈ N with m > 0. For inputs t = (t 0 , . . . , t m−1 ) with arrival times a, we denote by Λ t the weight of the last two (or fewer) entries of t, i.e.,
Then, there is a circuit realizing f (s, t) with delay at most d.
Note that since Λ t ≥ 0, Theorem 13 implies Theorem 11. In the rest of this section, we will prove Theorem 13. First, we observe upper bounds fulfilled in the setting of Theorem 13. Lemma 14. Assuming the conditions of Theorem 13, the following statements hold:
Proof. Due to requirement (4), we have
d Λ t and multiplying with d yields statement (5). Inequation (6) is implied by
Statement (6) implies that a symmetric tree on s and t is realizable with delay d − 1. We shall prove the stronger statement that f (s, t) is realizable with delay d. We are ready to prove this in the special cases that t has few entries or the given delay bound d is low.
Remark 15. Consider Theorem 13 for the special case that m ∈ {1, 2}. Recall that in this case, f (s, t) is a symmetric binary tree. By inequation (6), we know that W (t) + w < 2 d−1 . Hence, by Remark 7, we can realize f (s, t) with delay d − 1 using Huffman coding. Theorem 17. Assume inductively that for some d ≥ 3 and all 0 ≤ w < 2 d−1 , Theorem 13 holds. Then, for inputs s and t with w := W (s) such that 0 ≤ w < 2 d and
we can realize f (s, t) with delay (d + 1).
As a sub-calculation for the proof of this theorem, we need the following lemma.
Lemma 18. In the situation of Theorem 17, we have
Proof. Using the bound on Λ t implied by inequation (5), we calculate
This is the only ingredient needed to prove Theorem 17 for the case that 2
Lemma 19. Theorem 17 holds for all w satisfying
Proof. The symmetric split (1) yields the realization f (s, t) = (s 0 ∧ . . . ∧ s n−1 ) ∧ g(t). Since w < 2 d , Remark 7 allows the construction of a symmetric tree on inputs s with delay d. In order to show that f ((), t) = g(t) can be realized with delay d, by induction hypothesis, it suffices to show the second inequality in
Subtracting the left-hand side from the right-hand side, we prove this via
In the case 0 ≤ w < 2 d−1 , we need two bounds on the logarithm of consecutive integers.
For d ≥ 3, we have d ≥ ln(2)(d + 1) and thus
Now we will prove Theorem 17 for the case that 0 ≤ w < 2 d−1 .
Lemma 21. Theorem 17 holds for each w satisfying 0 ≤ w < 2 d−1 .
Proof. We prove this lemma via a case distinction. Case 1: Assume that
The symmetric split (2) yields , t 2 , . . . , t m−1 )) .
Due to inequation (6), we have
Hence, by Remark 7, we can realize s 0 ∧ . . . ∧ s n−1 ∧ t 0 as a binary tree with delay (d − 1). Thus, it suffices to check inductively that f * ((), (t 1 , t 2 , . . . , t m−1 )) = g * ((t 1 , t 2 , . . . , t m−1 )) can be realized with delay d. Note that requirement (7) and condition (10) imply
which we claim to be smaller or equal to
This can be shown by
Lem.18
This proves the lemma for the case that W (t 0 ) >
Therefore, we can consider a maximum odd-length prefix t of t such that
We define t := (t 2k+1 , . . . , t m−1 ) where k ∈ N such that t = (t 0 , t 1 , . . . , t 2k ). If t is empty, there is nothing to show since, by induction hypothesis, we can construct f (s, t) = f (s, t ) with a delay of d < d + 1 due to w < 2 d−1 . Otherwise, we will realize f (s, t) with delay d + 1 using the alternating split (3) for some prefix t * of t to be determined, i.e.,
where t * = (t 0 , t 1 , . . . , t 2l ) for some l ∈ N and t * * := (t 2l+1 , . . . , t m−1 ). Case 2 (i): Assume that t consists of at most 3 elements. We set t * := t , thus t * * = t . By induction hypothesis and due to w < 2 d−1 , inequation (11) allows realizing f (s, t ) with delay d. Hence, it suffices to show that f * ( t , t ) can be realized with delay d. If t has at most 2 elements, by Remark 7, we can realize f * ( t , t ) as a binary tree with delay d − 1 since W ( t ) + W (t ) ≤ W (t), which is not larger than 2 d−1 due to inequation (6) . If t contains exactly 3 elements, we can similarly show that the realization f * ( t , t ) = ( t ∨ t 2k+1 ) ∨ (t 2k+2 ∧ t 2k+3 ) yields delay d.
Λt Λ t t t * :=t t * * (a) In the case that W (t) ≤
, we set t * :=t.
, we set t * := t . Case 2 (ii): Assume that t contains at least 4 elements. Note that the first two elements t 2k+1 and t 2k+2 of t and the last two elements t m−2 and t m−1 of t are disjoint sets. Sett := (t 0 , . . . , t 2k+2 ). We need to find an appropriate prefix t * of t for realization (12) such that both f (s, t * ) and f * ( t * , t * * ) can be realized with delay d by induction hypothesis. We choose t * depending on the weight oft:
d Λt, we set t * := t . Note that in this case, we in particular have Figure 2 visualizes the case distinction. In either case, the weight of t * will be of the form
The upper bound on δ allows us to realize f (s, t * ) with delay d by induction hypothesis since w < 2 d−1 . It remains to show that f ( t * , t * * ) can be realized with delay d. Note that since t * does not contain any of the last two elements of t, we have
for d ≥ 2 and thus, by induction hypothesis, it suffices to see that
Due to requirement (7), we have
Since W ( t * ) ≤ W (t * ) and Λ t * * = Λ t , inequation (14) is implied if we prove
For this, we first only bound the summands in claim (15) which depend on W (t * ) or Λ t .
Based on this, the left-hand side of claim (15) can be bounded from below by
, which is required to be non-negative. Multiplying with the denominator, this is proven using the logarithm bounds stated in Remark 20 via
This proves the theorem.
Proof of Theorem 13. Lemmata 16, 19 and 21 together prove Theorem 13.
Remark 22. Now we see why we need the stronger Theorem 13 instead of Theorem 11, which is what distinguishes our proof from Grinchuk's [4] : The essential step in the proof of Lemma 21 is the choice of the prefix t * of t for realization (12) in Case 2 (ii) such that W (t * ) and W (t * * ) allow useful upper bounds. For uniform arrival times, we could choose t * := t since then assumption (11) 
This upper bound on W (t * * ) would suffice to prove that realization (12) yields the required delay.
For non-uniform arrival times, W (t ) can be arbitrary small in comparison to
. Thus, we need Theorem 13 in order to bound the gap.
Constructing Fast Circuits
Based on Theorem 11, we can show that there is a circuit realizing the AND-OR path t 0 ∧ (t 1 ∨ (t 2 ∧ (. . . t m−1 ) . . . ) with delay at most log 2 W + log 2 log 2 W + log 2 log 2 log 2 W + 5. However, we can modify the instance to diminish the dependency on W . The modification is based on the observation that we can round up small arrival times to the same value without losing too much for the maximum delay. Moreover, shifting all arrival times by some number does not change the problem. Both modifications allow us to reduce the problem to instances with a total arrival time weight of at most 2m. . . . ) with delay at most log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 5 .
Proof. We compute new arrival timesã : {t 0 , . . . , t m−1 } → N by setting
for all i ∈ {0, . . . , m − 1}. We define W := m−1 i=0 2ã (ti) and partition the input indices into I 1 := {i ∈ {0, . . . , m − 1} |ã(t i ) = 0} and I 2 := {0, . . . , m − 1} \ I 1 . Then, we have
Defined := log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 4.1 . Claim: There is a circuit C realizing the AND-OR path t 0 ∧ (t 1 ∨ (t 2 ∧ (. . . t m−1 ) . . . ) with arrival timesã with delay at mostd.
Proof of the claim: Let M := 4600. If m < M , we have 1.441 log 2 W + 2.674 ≤ 1.441 log 2 (2.072m) + 2.674 ≤ 1.441 log 2 m + 4.189 ≤ log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 4.1. Since the AND-OR path optimization method presented in [5] computes a circuit with delay at most 1.441 log 2 W + 2.674 , this proves the claim for m < M .
Hence assume m ≥ M . For proving the claim, by Theorem 11, it is sufficient to show 2.072m ≤ 2d
Note that the mapping x → for m ≥ M , equation (17) is hence valid if
This is equivalent to
which is true for m ≥ M . This proves the claim.
Since we have a(t i ) ≤ã(t i ) + (log 2 W − log 2 m − 0.1) for all i ∈ {0, . . . , m − 1}, the circuit C has, for the initial arrival times a : {t 0 , . . . , m − 1} → N, a delay of at most log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 4.1 + (log 2 W − log 2 m − 0.1) ≤ log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 5 .
Remark 24. In the proof, we apply method [5] for small instances. Without this trick, we would obtain a delay bound of log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 7. Moreover, for sufficiently large values of m, the delay bound in the previous theorem can be improved slightly to log 2 W + log 2 log 2 m + log 2 log 2 log 2 m + 4 + ε for any constant ε > 0: Note that the factor 1.72 in inequation (18) can be decreased to a value arbitrarily close to 1 if m is sufficiently large. Thus, also the factor Theorem 26. The circuit computed in Theorem 25 has size at most 18m log 2 m log 2 log 2 m and maximum fanout at most log 2 m + log 2 log 2 m + log 2 log 2 log 2 m + 4.1.
A proof of Theorems 25 and 26 can be found in Appendix A.
A Circuit Construction
Proof of Theorem 25. As main subroutine, we will use Algorithm 1. Claim: Given input variables s = (s 1 , . . . , s n−1 ) and t = (t 0 , . . . , t m−1 ) with arrival times a : {t 0 , . . . , t m−1 , s 0 , . . . , s n−1 } → N, Algorithm 1 computes a Boolean circuit realizing f (s, t) with delay at most d, where d is the smallest natural number with w := W (s) < 2 d−1 and
The number of computation steps of Algorithm 1 is O(m(m + n) log 2 (m + n) + m log 2 log 2 (W )), where W = m−1 i=0 2 a(ti) + n−1 i=0 2 a(si) . Proof of the claim: We apply the recursive approach described in Algorithm 1 which arises from the proof of Theorem 13: In line 2, we compute the minimum d ∈ N such that W (t) ≤
We have d ∈ O(log 2 (W )), so d can be computed by binary search in O(log 2 log 2 (W )) steps. Note that in line 2, we have w < 2 d−1 since otherwise, we would obtain a contradiction to Λ t ≤ W (t) since
for the maximum possible fanout of C. For each input t i , we can replace the outgoing edges of t i by a delay-optimum buffer tree with maximum fanout 2 for each buffer (compare Remark 7). This increases the size by at most f − 1 and, since we can assume that m ≥ 4600, the delay by at most log 2 f ≤ log 2 log 2 m + 1. This yields the stated properties of the transformed circuit.
