Delay Optimization of Combinational Logic by And-Or Path Restructuring by Brenner, Ulrich & Hermann, Anna
Delay Optimization of Combinational Logic
by AND-OR Path Restructuring
Ulrich Brenner and Anna Hermann
Research Institute for Discrete Mathematics, University of Bonn
{brenner,hermann}@dm.uni-bonn.de
Abstract—We propose a dynamic programming algorithm that con-
structs delay-optimized circuits for alternating AND-OR paths with
prescribed input arrival times. Our algorithm fulfills best-known ap-
proximation guarantees and empirically outperforms earlier methods by
exploring a significantly larger portion of the solution space.
Our algorithm is the core of a new timing optimization framework
that replaces critical paths of arbitrary length by logically equivalent
realizations with less delay. Our framework allows revising early decisions
on the logical structure of the netlist in a late step of an industrial physical
design flow. Experiments demonstrate the effectiveness of our tool on 7nm
real-world instances.
I. INTRODUCTION
In VLSI design, logic synthesis turns the abstract logic specifi-
cation of a chip into a concrete representation in terms of gates.
This happens very early in the design process, and for the following
steps, the logical description typically remains fixed. However, during
physical design, it may turn out that the chosen implementation of
the logic functionality was not the best choice, e.g., with respect to
placement or timing. Now it would be desirable to find a better suited
logically equivalent representation.
We propose an algorithm that improves timing by logic restructur-
ing of critical combinational paths. Optimizing a path boils down
to optimizing an AND-OR path, i.e., a Boolean function of type
t0 ∧ (t1 ∨ (t2 ∧ (t3 ∨ (t4 ∧ (. . . tm−1) . . . ), see [24].
Besides, AND-OR paths have an important application in the
construction of adder circuits. The carry bit computation in an adder
(which is the critical part) is equivalent to the evaluation of an AND-
OR path. The tasks of AND-OR path and adder optimization are
actually equivalent concerning timing if circuit size is disregarded.
Many efficient adder circuits (e.g., [4], [13], [14]) have been
proposed in the previous decades and could hence be used for
optimizing AND-OR paths. In terms of depth, the best approximation
guarantee for AND-OR path circuits has been proven by [8]. However,
these approaches optimize circuit depth, yielding fast circuits only
if all input signals arrive simultaneously. In our setting on the most
timing-critical path, this will rarely be the case. Instead, we minimize
circuit delay, a generalization of circuit depth that takes into account
individual prescribed input arrival times.
Some algorithms for adder optimization regard input arrival times,
but most lack provable guarantees: For adders with general arrival
times, there are a greedy heuristic [25] and a dynamic program [16],
but for both, no approximation ratio can be shown. In [21], the delay
of adders is evaluated regarding arrival times computed after physical
design, but the optimization goal is depth and not delay.
Algorithms for AND-OR path optimization with input arrival times
that achieve provably good approximation ratios are presented in [3],
[20] and [10]. We will explain their ideas in Section II-B. The method
of [20] is used in [24] to optimize general logic paths.
Our goal is to restructure critical paths of any length with provably
good approximation guarantees. In contrast, many other approaches
synthesize whole netlists and thus arbitrary Boolean functions. As
in general, finding a logically equivalent implementation of a given
circuit with, say, minimum depth is an NP-hard problem, these
approaches only replace sub-circuits of constant size by alternative
realizations (see e.g., [1], [5], [17], [19], [22]). Here, the new solution
is logically correct by construction, but an extension to larger sub-
circuits is hardly possible.
Our main contributions are:
• We propose a new dynamic program for delay optimization of
AND-OR paths. In fact, the algorithm solves a more general
problem, the optimization of so-called extended AND-OR paths.
We describe how decisions on the structure of sub-solutions
can be postponed until these sub-solutions are combined. This
reduces rounding effects that are inherent in previous algorithms.
• Our algorithm fulfills best known theoretical delay guarantees as
it is a common generalization of all previously best approaches
[3], [10], [20]. Moreover, we demonstrate in experiments that
we improve delay significantly compared to those.
• We compute lower bounds on the best possible delay of AND-
OR paths. On 89% of our test instances, the result of our algo-
rithm matches the lower bound and is thus provably optimum.
• We propose a framework for timing optimization of combina-
tional paths of arbitrary length based on [24] with our AND-OR
path restructuring algorithm as a core routine. The generic delay
model used in our core algorithm allows incorporating physical
locations. Our framework contains several classical timing-
optimization tools and – in contrast to the simple mapping used
in [24] – an evolved technology-mapping method [6].
• Experiments on recent industrial 7nm chips show the efficiency
and effectiveness of our framework. We improve worst slack and
total slack considerably without any impact on other metrics.
The rest of the paper is organized as follows. In Section II,
we define the AND-OR path optimization problem, survey known
approaches, and present our new approximation algorithm. Section III
describes our logic restructuring framework. Experimental results are
shown in Section IV, and Section V contains concluding remarks.
II. AND-OR PATH OPTIMIZATION
Note that in this section, we use a simplified linear delay model
with unit gate delay and zero wire delay. In Section III-A, we will
generalize this model to adapt to our application in physical design.
A. Problem formulation
For us, a circuit C is a connected acyclic digraph whose nodes
can be partitioned into two sets: inputs with no incoming edges
representing Boolean variables, and gates representing an elementary
Boolean function (mostly AND2 or OR2, i.e., AND and OR gates
with fan-in two), where only a single gate out(C) called output has
no outgoing edges. An AND-OR path on inputs t0, . . . , tm−1 is a
Boolean formula of type
g(t0, . . . , tm−1) = t0 ∧ (t1 ∨ (t2 ∧ (t3 ∨ (t4 ∧ (. . . tm−1) . . . ) or
g∗(t0, . . . , tm−1) = t0 ∨ (t1 ∧ (t2 ∨ (t3 ∧ (t4 ∨ (. . . tm−1) . . . ) .
On the left-hand side of Figure 1, a circuit for the AND-OR path
g(t0, t1, t2, t3, t4) is shown. Given individual arrival times a(ti) ∈ R
ar
X
iv
:2
00
9.
08
84
4v
1 
 [c
s.D
S]
  1
8 S
ep
 20
20
45
6
7
2 2 3 1 3
t0 t1 t2 t3 t4
2 2 3 1 3
t0 t1 t2 t3 t4
4 3
5 4
6
Fig. 1: Two circuits computing the function g(t0, . . . , t4) = t0∧(t1∨
(t2 ∧ (t3 ∨ t4))) with input and gate arrival times.
for each input signal ti, i = 0, . . . ,m − 1, we ask for a Boolean
circuit computing g(t0, . . . , tm−1) that consists of AND2 and OR2
gates only and is timing-wise best possible in the following sense: We
assume that traversing a gate takes 1 time unit, so the gate arrival
time is the maximum of its predecessors’ arrival times plus 1. By
scanning a circuit C from the inputs to the output, we can compute
arrival times at all gates. The delay of a circuit is defined as the arrival
time at out(C). Summarizing, we study the following problem:
AND-OR PATH OPTIMIZATION
Instance: m ∈ N, Boolean input variables t = (t0, . . . , tm−1),
arrival times a(t0), . . . , a(tm−1) ∈ R.
Task: Compute a circuit C using only AND2 and OR2 gates
realizing g(t) or g∗(t) with minimum possible delay.
Figure 1 shows how gate arrival times are computed in two circuits
that both realize the AND-OR path g(t0, t1, t2, t3, t4). The circuits
have a delay of 7 and 6, respectively. Note that in the special case
when all input arrival times are 0, circuit delay is exactly circuit
depth, i.e., the length of a longest directed path.
Given an instance consisting of inputs t0, . . . , tm−1 with arrival
times a(t0), . . . , a(tm−1), we define the weightW :=
∑m−1
i=0 2
a(ti).
It is not too difficult to see that dlog2(W )e is a lower bound for the
delay of any binary circuit computing an AND-OR path for inputs
t0, . . . , tm−1 with arrival times a(t0), . . . , a(tm−1) ∈ N (this boils
down to Kraft’s inequality [15]; see [20] for a concise proof).
Optimizing g(t) and g∗(t) is equivalently hard: By the duality
principle of Boolean algebra, any circuit for g(t) consisting of AND
and OR gates can be transformed into a circuit for g∗(t) with the
same delay by exchanging AND and OR gates and vice versa.
B. Previous Algorithms
A common approach for AND-OR path optimization is the appli-
cation of recursion formulas that allow reducing the problem to the
construction of circuits for AND-OR paths with fewer inputs.
The algorithm by Rautenbach et. al [20] is based on the following
equation (for λ ∈ N with 2λ < m− 2):
g(t0, . . . , tm−1) = g(t0, . . . , t2λ−1) (1)
∨ (t0 ∧ t2 ∧ t4 ∧ · · · ∧ t2λ−2 ∧ g(t2λ, . . . , tm−1))
To see the correctness of (1), check that g(t0, . . . , tm−1) is true
exactly in the following two cases:
• g(t0, . . . , t2λ−1) is true (then the other inputs do not matter)
• g(t2λ, . . . , tm−1) is true and the value “true” is propagated to
the output because the inputs t0, t2, t4, . . . , t2λ−2 are all true
See [20] for a detailed proof. Using formula (1), an AND-OR path
circuit on inputs t0, . . . , tm−1 can be constructed by combining AND-
OR path circuits on inputs t0, . . . , t2λ−1 and on inputs t2λ, . . . , tm−1
and a circuit for a multi-input AND on the inputs t0, t2, . . . , t2λ−2.
Using (1) in a dynamic program with running time O(m3), the
authors of [20] construct AND-OR path circuits with delay at most
1.441 log2(W )+3. Held and Spirkl [10] obtain a slightly better delay
bound of 1.441 log2(W ) + 2.673 using the dual of the following
equation (for λ with 2λ < m− 1):
g(t0, . . . , tm−1) = g(t0, . . . , t2λ) (2)
∧ ((t1 ∨ t3 ∨ . . . ∨ t2λ+1) ∨ g(t2λ+2, . . . , tm−1))
Their algorithm runs in time O(m log22m) as they explicitly choose
λ in each recursion step. The proof of (2) is analogous to the proof
of (1), but here one should check in which cases the two formulas
are false. We will use (2) in a slightly different equivalent form (note
that t2λ+1 ∨ g(t2λ+2, . . . , tm−1) = g∗(t2λ+1, . . . , tm−1)):
g(t0, . . . , tm−1) = g(t0, . . . , t2λ) (3)
∧ ((t1 ∨ t3 ∨ . . . ∨ t2λ−1) ∨ g∗(t2λ+1, . . . , tm−1))
As (1) and (3) contain functions combining a multi-input AND or
OR with an AND-OR path, we define for t = (t0, . . . , tm−1) and
0 ≤ i ≤ j ≤ k < m with j − i even the extended AND-OR paths
φi,j,k := ti ∧ ti+2 ∧ . . . ∧ tj−4 ∧ tj−2 ∧ g(tj , . . . , tk) and
φ∗i,j,k := ti ∨ ti+2 ∨ . . . ∨ tj−4 ∨ tj−2 ∨ g∗(tj , . . . , tk) .
The extended AND-OR path φ0,4,12 is depicted in Figure 2(a). From
the splits (1) and (3), using extended AND-OR paths as a more
flexible replacement for sub-functions, we deduce the splits
φ0,0,m−1 = φ0,0,2λ−1 ∨ φ0,2λ,m−1 for 1≤λ≤m−12 (4)
φ0,0,m−1 = φ0,0,2λ ∧ φ∗1,2λ+1,m−1 for 0≤λ≤m−22 (5)
that can be generalized to extended AND-OR paths as in
φi,j,k = φi,j,j+2λ−1 ∨ φi,j+2λ,k for 1≤λ≤ k−j2 , (6)
φi,j,k = φi,j,j+2λ ∧ φ∗j+1,j+2λ+1,k for 0≤λ≤ k−j−12 . (7)
Note that in (6) and (7), the functions on the right-hand side depend
on fewer inputs than φi,j,k. Figure 1 shows an example for split (5)
with λ = 1, and Figures 2(a) and 2(b) for split (7) with λ = 2.
Using split (7) and its dual, Grinchuk [8] proves the upper bound
log2m + log2 log2m + 3 on the depth of AND-OR path circuits,
and Brenner and Hermann [3] give an algorithm for arbitrary integer
arrival times with running time O(m2 log2m) and a delay bound of
log2W + log2 log2m+ log2 log2 log2m+ 4.3. (8)
In the special case when k− j ≤ 1, φi,j,k is actually a multi-input
AND, and the function can be realized by a delay-optimum circuit
using a greedy algorithm called Huffman coding:
Theorem 1 (Golumbic [7], based on Huffman [11]). Given inputs
t0, . . . , tm−1 with arrival times a(ti), a delay-optimum circuit for
the Boolean function t0∧ . . .∧ tm−1 (or t0∨ . . .∨ tm−1) can be con-
structed in O(m log2m) time. If a(ti) ∈ N for all i = 0, . . . ,m− 1,
then the delay of an optimum circuit is dlog2(W )e.
C. Our Approach
We present an algorithm for AND-OR path optimization with
prescribed input arrival times that generalizes any of the algorithms
in [3], [10], [20]. In particular, on any instance, the delay of our
solution is at least as good as the delay computed with any of the
three algorithms, and on most instances, it is better, cf. Section IV.
To simplify notations, hereafter we assume that all arrival times
are integral. Still, our implementation allows arbitrary arrival times.
Recall that in the AND-OR path optimization problem, we aim
at computing a circuit containing only fan-in-2 gates. However, in
intermediate steps, we allow a larger fan-in for the gate computing
the output of the circuit. This leads to the following definition.
t12t11t10t9t8t7t6t5t4t2t0
(ti, tj−2) (tj , . . . , tk)
(a) A simple circuit realizing f .
t12t11t10t9t8t7t6t5t4t2t0
(ti, tj−2) (tj , . . . , tj+2λ) (tj+2λ+1, . . . , tk)
out(C2)out(C1)
c0
(b) Illustration of split (7) on f .
t12t11t10t9t8t7t6t5t4t2t0
(ti, tj−2) (tj , . . . , tj+2λ) (tj+2λ+1, . . . , tk)
out(Ci,j,k)
(c) Output of Algorithm 1 on Figure 2(b).
Fig. 2: A possible way to construct circuit Ci,j,k realizing φi,j,k in Algorithm 2 with i = 0, j = 4, k = 12 and split (7) with λ = 2. In this
example, we use naive implementations for C1 and C2.
Definition 2. An undetermined circuit is a Boolean circuit C consist-
ing of AND and OR gates only such that all gates with the possible
exception of out(C) have fan-in two. With given input arrival times,
the weight of C is weight(C) :=
∑k
i=1 2
di , where d1, . . . , dk are
the arrival times at the predecessors of out(C).
In Figure 1, the weight of the left and right undetermined circuit
is 22+26 = 68 and 25+24 = 48, respectively. Figure 2(c) displays
an undetermined circuit with fan-in 5 at the output gate.
For undetermined circuits, we do not yet specify how we realize
the output gate by fan-in-2 gates. This allows greater flexibility when
combining several such circuits to a larger circuit. The following
lemma shows that optimizing the weight of an undetermined circuit
can be used to compute fan-in-2 circuits with small delay.
Lemma 3. Given an undetermined circuit C, we can construct a
Boolean circuit using AND2 and OR2 gates only that computes the
same Boolean function as C with delay at most dlog2(weight(C))e.
Proof. Apply Huffman coding with the predecessors of out(C) as
inputs (see Theorem 1).
Algorithm 2 states our overall dynamic programming algorithm for
AND-OR path optimization on inputs t0, . . . , tm−1, which works as
follows: We compute a cubic-size table that contains undetermined
circuits Ai,j,k and Oi,j,k realizing the extended AND-OR path φi,j,k
for all 0 ≤ i ≤ j ≤ k ≤ m− 1 and j− i even, where out(Ai,j,k) =
AND and out(Oi,j,k) = OR. In particular, this computes circuits for
the entire AND-OR path φ0,0,m−1 = g(t0, . . . , tm−1).
Note that when k = j or k = j + 1, the function φi,j,k is a
multiple-input AND, hence, in Line 4, an optimum solution can be
found by Huffman coding (see Theorem 1). To compute undetermined
circuits for φi,j,k with j − i even and k > j + 1, we assume that
we have already computed undetermined circuits for φ for instances
with fewer inputs. Then, in Line 6, we can enumerate all possible
choices of λ in the splits (6) and (7) to recursively compute a
circuit C for φi,j,k from pre-computed solutions (while dualizing
one sub-circuit accordingly in split (7)). Since the combination of
two undetermined circuits is not necessarily an undetermined circuit,
we apply Algorithm 1. Here, in Line 7, we fix the structure of
the undetermined sub-circuit Ci as a circuit C′i over {AND2,OR2}.
Figure 2 shows an example of split (7). In Algorithm 2, the circuit C
is stored in a candidate list C of undetermined circuits for φi,j,k. The
undetermined circuits among C with the best weight with an AND
or OR gate at the output are stored as Ai,j,k in Line 7 and Oi,j,k in
Line 8, respectively.
As final circuit for φ0,0,m−1, we choose the weight-minimum
circuit among A0,0,m−1 and O0,0,m−1 in Line 9, made a circuit
over {AND2,OR2} by Lemma 3.
Algorithm 1: Merging 2 undetermined circuits.
Input: Undetermined circuits C1 and C2 computing Boolean
functions h1 and h2; a gate type ◦ ∈ {AND,OR}.
Output: An undetermined circuit C computing h1 ◦ h2.
1 Add a ◦ gate c0 to the union of the circuits C1 and C2.
2 for i← 1 to 2 do
3 Let c1, . . . , ck be the predecessors of out (Ci).
4 if out (Ci) is a ◦ gate then
5 Remove out (Ci) and add edges (c1, c0), . . . , (ck, c0).
6 else
7 Use Lemma 3 to construct a circuit C′i from Ci.
8 Add an edge from out (C′i) to c0.
Algorithm 2: AND-OR path optimization.
Input: Boolean variables t0, . . . , tm−1 with arrival times
a(t0), . . . , a(tm−1) ∈ N.
Output: A Boolean circuit computing g(t0, . . . , tm−1).
1 for l← 1 to m do
2 for 0 ≤ i ≤ j ≤ k < m, j − i even s.t. φi,j,k has l inputs do
3 if k ∈ {j, j + 1} then // φi,j,k multi-input AND
4 Ai,j,k := circuit computed by Huffman coding.
5 else
6 C := list of undetermined circuits for φi,j,k arising
from applying split (6) or (7) with any valid λ,
followed by a call to Algorithm 1.
7 Ai,j,k := argmin{W (C) : C ∈ C, out(C) = AND}.
8 Oi,j,k := argmin{W (C) : C ∈ C, out(C) = OR}.
9 C := argmin{W (A0,0,m−1),W (O0,0,m−1)}.
10 return Circuit C′ resulting from applying Lemma 3 to C.
Theorem 4. Algorithm 2 computes a circuit with delay at most
log2(W ) + log2 log2(m) + log2 log2 log2(m) + 4.3
and can be implemented to run in time O(m4).
Proof. (SKETCH) Algorithm 2 considers, in particular, all recursion
steps from [3]. Using this, one can show that for any sub-instance
φi,j,k, the algorithm computes a solution which is at least as good as
the solution computed by the algorithm from [3] and thus also meets
the delay bound (8). The running time is dominated by O(m4) calls
to Algorithm 1, which can be implemented to run in constant time if
only weights and delays are computed and only the final circuit C′
in Line 10 of Algorithm 2 is actually constructed.
Yes
No
Has	slack	improved
	by	at	least	 ? 
min
Preoptimization:	
Apply	detailed	optimization	to	P.
Revert	changes	of
last	preoptimization.
Normalize	S	and	extract	an
And-Or	path	S'	from	S.
No
Yes
Has	slack	improved
	by	at	least	 	in	last
	 	iterations?
 
  
num
it
Apply	Algorithm	2	to	S'.
Apply	technology	
mapping	to	S.
For	each	sub-path	S	of	P	
with	length	at	most	 : 
max
start
Store	S	in	list	L	of
restructuring	candidates.
Sort	L	by	decreasing	
estimated	slack	gain.
Pop	k	candidates	from	L	and
tentatively	apply	detailed
optimization	to	each	of	them.
Yes
No
Is	 ?≥ 
 
 
 
Choose	the	candidate	C	with	best
actual	slack	gain	 	seen	so	far. 
 
Relax	 . 
t Yes
Is	 	and	has	no
subpath	slack	decreased
beyond	P's	initial	slack?
≥ 
 
 
min
Initialize	 .:= 
 
 
target
end
end loop 
New	iteration:
Choose	a	critical	path	P.
No
Implement	netlist	change	C.
Fig. 3: Flow chart for our logic optimization framework (cf. Section III) with the path restructuring step in green.
We conjecture that a stronger theoretical delay bound can be proven
for our algorithm.
In Section IV, we will see that in our practically applied logic op-
timization framework, the running time of Algorithm 2 is negligible.
In order to take care of the circuit size, we can modify Algorithm 2
as follows: For each sub-instance φi,j,k, we store not just one circuit
with the best delay per output gate type, but all non-dominated
circuits. Here, circuit C dominates circuit C′ if both weight and size
of C are at least as good as in C′ and if the gate types of out (C)
and out (C′) coincide. In the end, we choose C to be the smallest
among all weight-optimum circuits. This does not affect the delay of
the circuit (and Theorem 4 still holds), but often reduces its size.
III. LOGIC OPTIMIZATION FRAMEWORK
We propose a timing optimization framework (cf. Figure 3) based
on Werber et al. [24] with Algorithm 2 as an essential component that
is used in production in a late pre-routing stage of an industrial physi-
cal design flow. Our framework revises the logical structure of critical
paths using placement and timing information. In Section III-A, we
adapt the delay model used in Algorithm 2 to respect placement,
buffering and gate sizing effects. As we do not fully account for
different kinds of gates or different gate sizes that might be available,
our framework involves a technology mapping step (Section III-B)
and powerful gate sizing and buffering routines (Section III-C).
We iteratively optimize the worst slack of the currently most
timing-critical combinational path until overall worst slack does not
improve significantly anymore. A single iteration works as follows:
Let P denote a most critical path. During a preoptimization step,
we first try to improve the slack of P without changing its logical
structure in order to diminish disruptions. To this end, we apply
detailed optimization to P as described in Section III-C. If a threshold
slack improvement of δmin is exceeded, we keep the changes imposed
by preoptimization and start the next iteration.
Otherwise, we discard the preoptimization’s changes and perform
the path restructuring step (central, green part of Figure 3). This step
works on internal data structures; the netlist is not changed before
detailed optimization (Section III-C). We consider the possibility to
optimize any sub-path S of P up to a maximum length of mmax.
First, we apply a normalization (Section III-A) in order to extract an
AND-OR path S′ from S on which we run Algorithm 2. Then, the
technology mapping routine from [6] (see also Section III-B) locally
modifies S to benefit from all available gate types. After having
optimized all sub-paths of P , we store all restructuring possibilities
in a list L, sorted by decreasing estimated slack gain.
For only the most promising fraction of restructuring options, we
apply the time-consuming detailed optimization (cf. Section III-C).
First, we tentatively apply detailed optimization to the topmost k
candidates in L. If the actual slack gain of the best solution exceeds
δtarget, we choose this solution; otherwise, we iteratively decrease
δtarget by a fixed value and try out the next k candidates in L
until we reach δtarget or L is empty. Afterwards, we choose the
restructuring candidate C with best actual slack gain δC for P among
all detailed-optimized solutions. This way, we usually apply detailed
optimization to only a few instances, but still find the overall best
restructuring option. If δC ≥ δmin and if no side path slack has
worsened beyond the initial slack of P , we implement this netlist
change, possibly retaining parts of P needed for side outputs. If the
change is implemented and the slack gain over the last numit iterations
exceeds a threshold δit, we start the next iteration; otherwise, we stop.
Note that this is a simplified flow description. E.g., in practice, we
optimize the second critical path or the most critical latch-to-latch
path when P cannot be further optimized.
A. Normalization
Our AND-OR path optimization algorithm from Section II expects
as an input an alternating path of AND2 and OR2 gates with
prescribed input arrival times, and assumes that gates have a unit
delay and connections do not impose any delay. However, the most
critical path P contains arbitrary gates with varying delays, and the
physical locations of the path inputs might be far apart, inducing
undeniably high wire delays even after buffering. A normalization
step thus transforms P into a piece of netlist whose core part is an
AND-OR path with appropriately modified input arrival times.
As we work on the most critical path, the buffering routine applied
in Section III-C will compute delay-optimum solutions. Thus, we can
assume a linear wire delay and estimate the wire delay between two
physical positions p1 and p2 by ddist · ||p1 − p2||1 for a constant
ddist ∈ R. The traversal time through a gate is approximated by a
constant dgate ∈ R. The constants dgate and ddist are chosen based on an
analysis of typical values on the respective design. As on the critical
path, there are rather low fan-outs and slews, the delay of gates with
different types and sizes still varies, but not much in comparison to
t0 t1 t2 t3 t4 t5 t0 t1 t2 t3 t4 t5
Fig. 4: A subpath S of the critical path P before (left) and after
normalization (right). On the right, the extracted AND-OR path S′ is
colored. Critical wires are drawn in red.
the differences in arrival times. Hence, assuming a realistic constant
gate delay suffices to determine the logical structure of the circuit.
Since we work on the most timing-critical part of the design, we
place the circuit C computed by Algorithm 2 such that each path is
embedded delay-optimally, implying that each path from an input ti
to out(C) has a wire delay of ddist · ||l(ti) − l(out(C))||1, where l
indicates physical coordinates on the chip. Thus, the delay of C is
maxQ : ti out(C)
{
a(ti) + ddist · ||l(ti)− l(out(C))||1 + dgate · |Q|
}
,
where the maximum ranges over all paths Q in C from any input ti
to out(C). Applying Algorithm 2 with modified arrival times
a′(ti) :=
1
dgate
(
a(ti) + ddist · ||l(ti)− l(out(C))||1
)
hence yields a circuit with optimum wire delay with respect
to physical locations. In fact, we choose a placement that is
netlength-optimum among all delay-optimum placements: We deter-
mine l(out(C)) based on its successors in the netlist and place each
gate at the median position of its predecessors and out(C).
Now, we can describe our normalization. Let x denote the most
critical input of a sub-path S of P . We represent each gate in S using
AND2 and INV gates only. This does not necessarily yield a path, but
we can recover the original critical path by following the signal flow
of x, obtaining a path S′. By applying De Morgan transformations
in reverse topological order, we ensure that S′ contains AND2 and
OR2 gates only, possibly adding inverters at the inputs of S′. We
use Huffman coding (Theorem 1) on chains of AND2 gates (or OR2
gates) in S′ to move less critical gates into S\S′, respecting physical
locations by modifying arrival times as above. This way, S′ becomes
an AND-OR path that – with input arrival times a′ – can be passed
to Algorithm 2. Figure 4 depicts the normalization on a path S (left)
containing inverters (bubbles), NOR, and OAI gates. On the right, we
show S after normalization with the AND-OR path S′ colored.
B. Technology Mapping
The purpose of our technology mapping step is to change the newly
created circuit locally to improve worst slack and the physical area
occupied by gates by making use of all gates available on the design.
We use the dynamic programming algorithm from Elbert [6] which
covers the input circuit by graphs representing the available gate
types. With respect to any fixed tradeoff of arrival time (regarding
our timing model from Section III-A, but with specific estimated
delays per gate type) and number of gates, this algorithm computes
an optimum technology mapping, but the running time grows expo-
nentially in the number l of gates with more than one successor. In our
application, l is usually very small, hence we can effort this running
time (cf. the end of Section IV). For constant l, [6] also provides
a fully polynomial-time approximation scheme. On general circuits,
computing a size-optimum technology mapping is NP-hard [12].
C. Detailed Optimization
Depending on the actual stage of the design, our detailed opti-
mization step invokes buffering, layer assignment and gate sizing
tools. When used in late physical design, we apply Held’s gate sizing
routine [9], followed by the buffering tool with an integrated layer
assignment by Bartoschek et al. [2]. After buffering, we apply gate
sizing again, in particular on newly inserted buffers. As we work on
the most critical fraction of the design, Vt assignment can be done
conveniently by using the fastest gates available.
An incremental placement legalization makes sure that the place-
ment remains legal throughout all netlist changes.
IV. EXPERIMENTAL RESULTS
In a first set of experiments, we examined the AND-OR path
optimization algorithm from Section II separately. To this end, we
created AND-OR path instances with 4 to 28 inputs and random
integral arrival times chosen uniformly from the interval [0,#inputs].
For each number of inputs, we created 1000 instances.
We compared our results with the previously best methods [3],
[10], and [20]. For each instance, we ran all three algorithms
and compared the best result in terms of delay to our algorithm’s
output. Figure 5 visualizes our results. Instances are grouped by their
numbers of inputs, and colors indicate the absolute delay difference
of computed solutions. Our algorithm covers all recursion options
from [3], [10], [20], so our solutions can never be worse. In fact, on
almost all instances, the delay of our circuit is better, and already for
18 inputs, on every other instance better by 2 or more.
For each instance, we computed a lower bound on delay based
on the following ideas: First, Kraft’s inequality [15] imposes a lower
bound on the delay of any binary circuit; secondly, we enumerate
possible local gate configurations near the output of an AND-OR
path circuit C and recursively compute lower bounds for sub-circuits.
We compared our delay to the resulting lower bound. Among all
our solutions, 89% achieve the lower bound and hence are provably
delay-optimum, and only 0.012% exceed the lower bound by 2.
Figure 6 compares our realization with [10] on an example
instance. In our circuit, the splits (6)∗, (7) and (7)∗ were applied,
and the ability to optimize undetermined circuits was used twice.
This way, our delay of 22 is better than the delay found by [10], and
it is even optimum since the input with arrival time 20 has to traverse
at least 2 gates in any solution. On this instance, we need one more
gate than [10]. In general, the number of gates used by our algorithm
(with our modification for size reduction) is typically higher than in
[3], [10], [20], but mostly in the range of 20%.
In a second set of experiments, we examined our logic optimization
framework as a whole. Table I shows results on recent 7nm pre-
routing designs using the RICE delay model. The ’init’ row displays
the state of the chips as in our application in industry: a timing-driven
4 5 10 15 20 25 28
num inputs
0
200
400
600
800
1000
in
st
an
ce
delay gain
0
1
2
3
4
Fig. 5: Delay gain of the solutions computed by Algorithm 2
compared to the best solution among [3], [10], [20] on instances
with random integral input arrival times.
14719136817206
15
20
21
22
23
24
25
26
27
14719136817206
15771821
209
21
22
23
24
14719136817206
154921
167
209
10
18
21
22
(7)∗, λ = 0, undet.
(7), λ = 0, undet.
(7), λ = 2
(6)∗, λ = 1
Fig. 6: Three logically equivalent AND-OR path circuits. The circuit on the left has delay 27 and size 9, the circuit in the middle computed
by [10] delay 24 and size 11, and our circuit on the right delay 22 and size 12. In our circuit, we indicate the splits used by Algorithm 2.
Unit Run WS [ps] TS [ns] # Gates Area Netlength ACE5 T [s]
i1 init 201 15.3 40 636 85%
LO 188 15.3 40 629 −0.02% 0.00% 86% 12
i2 init 62 52.2 62 185 96%
LO 58 52.3 62 187 +0.02% +0.04% 96% 11
i3 init 109 192.9 69 049 107%
LO 93 189.4 69 066 +0.01% 0.00% 107% 273
i4 init 5 0.1 78 030 99%
LO 0 0.0 77 966 −0.06% −0.07% 99% 59
i5 init 159 345.8 210 828 94%
LO 152 343.4 210 852 +0.02% 0.00% 94% 287
i6 init 34 13.0 264 744 89%
LO 20 8.5 264 724 0.00% +0.01% 88% 228
i7 init 92 251.5 272 020 96%
LO 77 230.1 272 242 +0.03% +0.06% 95% 525
i8 init 136 850.1 327 807 90%
LO 120 833.1 327 916 +0.01% +0.02% 90% 249
TABLE I: Performance of our logic restructuring framework on 7nm
real-world instances.
placement has been computed, followed by various timing optimiza-
tion steps, among those our buffering and gate sizing sub-routines.
The initial netlist cannot be improved any further by classical timing
optimization. The ’LO’ row shows results after applying our logic
optimization flow to this netlist. We see that worst slack (WS) and the
total sum of negative slacks (TS) mostly improve significantly during
logic optimization. This does not disrupt global objectives as area,
number of gates, netlength, and routability, which barely change. To
check routability, we use the ACE5 estimate from [23], the average
congestion of the 5 % most congested resources, weighted by usage,
computed by the global router from [18].
Our program was implemented in C++, and all tests were executed
on a machine with two Intel(R) Xeon(R) CPU E5-2667 v2 processors,
using a single thread. In the last column (T), we show the total
running time of our flow, which is largely dominated by gate sizing
because it performs many expensive queries to the timing engine.
On any design, the total running time of all calls to Algorithm 2
is less than 1 second, and less than 4 seconds for the whole path
restructuring step. Per design, we consider roughly 1500 AND-OR
path restructuring instances with up to 13 inputs.
V. CONCLUSION
We presented a new approximation algorithm for delay optimiza-
tion of AND-OR paths and a logic optimization framework using this
algorithm to improve critical paths in late physical design. Regarding
a simple, but realistic delay model, our algorithm fulfills best known
mathematical guarantees, outperforms previously best approaches and
is often optimum. Results on industrial 7nm designs demonstrate that
our logic optimization framework improves timing when traditional
timing optimization tools are at an end.
REFERENCES
[1] L. Amaru´, M. Soeken, P. Vuillod, J. Luo, A. Mishchenko, P.-E. Gail-
lardon, J. Olson, R. Brayton, and G. De Micheli. Enabling exact delay
synthesis. ICCAD, pages 352–359, 2017.
[2] C. Bartoschek, S. Held, D. Rautenbach, and J. Vygen. Fast buffering
for optimizing worst slack and resource consumption in repeater trees.
ISPD, pages 43–50, 2009.
[3] U. Brenner and A. Hermann. Faster carry bit computation for adder
circuits with prescribed arrival times. TALG, 15(4):45:1–45:23, 2019.
[4] R. P. Brent and H.-T. Kung. A regular layout for parallel adders. Trans.
Comput., 31(3):260–264, 1982.
[5] J. Cortadella. Timing-driven logic bi-decomposition. TCAD, 22(6):675–
685, 2003.
[6] L. Elbert. Aproximationsalgorithmen im Technology Mapping. Bache-
lor’s thesis, University of Bonn, 2017. German.
[7] M. C. Golumbic. Combinatorial merging. Trans. Comput., 25(11):1164–
1167, 1976.
[8] M. I. Grinchuk. Sharpening an upper bound on the adder and comparator
depths. J. Appl. Ind. Math., 3(1):61–67, 2009.
[9] S. Held. Gate sizing for large cell-based designs. DATE, pages 827–832,
2009.
[10] S. Held and S. Spirkl. Fast prefix adders for non-uniform input arrival
times. Algorithmica, 77(1):287–308, 2017.
[11] D. A. Huffman. A method for the construction of minimum-redundancy
codes. Proc. Inst. Radio Eng., 40(9):1098–1101, 1952.
[12] K. Keutzer and D. Richards. Computational complexity of logic
synthesis and optimization. IWLS, 1989.
[13] V. M. Khrapchenko. Asymptotic estimation of the addition time of a
parallel adder. Systems Theory Research, 19:105–122, 1970.
[14] P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient
solution of a general class of recurrence equations. Trans. Comput.,
100(8):786–793, 1973.
[15] L. G. Kraft. A Device for Quantizing, Grouping, and Coding Amplitude-
Modulated Pulses. PhD thesis, MIT, 1949.
[16] J. Liu, S. Zhou, H. Zhu, and C.-K. Cheng. An algorithmic approach for
generic parallel adders. ICCAD, pages 734–740, 2003.
[17] A. Mishchenko, R. Brayton, S. Jang, and V. Kravets. Delay optimization
using sop balancing. ICCAD, pages 375–382, 2011.
[18] D. Mu¨ller, K. Radke, and J. Vygen. Faster min–max resource sharing
in theory and practice. MPC, 3(1):1–35, 2011.
[19] S. M. Plaza, I. L. Markov, and V. Bertacco. Optimizing non-monotonic
interconnect using functional simuation and logic restructuring. ISPD,
pages 92–102, 2008.
[20] D. Rautenbach, C. Szegedy, and J. Werber. Delay optimization of
linear depth Boolean circuits with prescribed input arrival times. JDA,
4(4):526–537, 2006.
[21] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan. Towards optimal
performance-area trade-off in adders by synthesis of parallel prefix
structures. TCAD, pages 1517–1530, 2014.
[22] L. Stok, D. Kung, D. Brand, A. D. Drumm, A. J. Sullivan, L. Reddy,
N. Hieter, D. J. Geiger, H. H. Chao, and P. J. Osler. BooleDozer:
Logic synthesis for ASICs. IBM Journal of Research and Development,
40:407–430, 1996.
[23] Y. Wei, C. Sze, N. Viswanathan, Z. Li, C. J. Alpert, L. Reddy, A. D.
Huber, G. E. Tellez, D. Keller, and S. S. Sapatnekar. Glare: Global and
local wiring aware routability evaluation. DAC, pages 768–773, 2012.
[24] J. Werber, D. Rautenbach, and C. Szegedy. Timing optimization by
restructuring long combinatorial paths. ICCAD, pages 536–543, 2007.
[25] W.-C. Yeh and C.-W. Jen. Generalized earliest-first fast addition
algorithm. Trans. Comput., pages 1233–1242, 2003.
