Timing-driven logic bi-decomposition by Cortadella, Jordi
IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003 675
Timing-Driven Logic Bi-Decomposition
Jordi Cortadella, Member, IEEE
Abstract—An approach for logic decomposition that produces
circuits with reduced logic depth is presented. It combines two
strategies: logic bi-decomposition of Boolean functions and
tree-height reduction of Boolean expressions. It is a technology-in-
dependent approach that enables one to find tree-like expressions
with smaller depths than the ones obtained by state-of-the-art
techniques. The approach can also be combined with technology
mapping techniques aiming at timing optimization. Experimental
results show that new points in the area/delay space can be
explored, with tangible delay improvements when compared to
existing techniques.
Index Terms—Bi-decomposition, delay optimization, logic
decomposition, tree-height reduction.
I. INTRODUCTION
DELAY optimization can be tackled at different stages ofcircuit synthesis, from high-level to layout. This paper
focuses on technology-independent logic synthesis techniques
for combinational circuits [1] that typically precede technology
mapping.
Given the complexity of the problem, delay optimization is
usually performed after the size of the Boolean network rep-
resenting the circuit has been reduced. Numerous multilevel
logic synthesis techniques exist for that, either by using alge-
braic [2], [3] or Boolean methods [4]. Most of the techniques
on technology-independent delay optimization aims at reducing
the depth of Boolean networks by restructuring [5]–[7]. Even
the depth of a network is not an accurate estimation of the circuit
delay; both have a high correlation. For this reason, the depth of
the network is a parameter frequently used in technology-inde-
pendent optimization techniques.
Reducing the size of a Boolean network often implies the
extraction of common subexpressions that can be shared in
several subnetworks. As a side effect, sharing may also lead to
increasing the depth of the network. Thus, when delay is the
parameter under optimization, sharing logic is not always a
good approach for logic decomposition. Even if we disregard
the delays produced by the fanout capacitances, increasing the
degree of sharing may negatively affect the performance of
a circuit.
Fig. 1 depicts three different circuits implementing the same
Boolean function. Each circuit is represented by a directed
acyclic graph (DAG) of two-input gates. The bubbles on the
arcs represent inverters. The depth of the circuit is calculated as
Manuscript received September 22, 2002; revised December 27, 2002. This
work was supported in part by a grant from the Intel Corporation and by a grant
from CICYT TIC2001 2476. This paper was recommended by Guest Editor
L. Stok.
J. Cortadella is with the Department of Software, Universitat Politècnica de
Catatunya, Barcelona 08034, Spain (e-mail: jordi.cortadella@upc.es).
Digital Object Identifier 10.1109/TCAD.2003.811447
Fig. 1. Three different circuits implementing the same Boolean function.
the number of nodes of the longest path in the DAG (inverters
are ignored). A DAG can be unfolded in such a way that
no multiple-fanout nodes exist, except for the inputs of the
circuit, thus obtaining a tree with the same depth. The numbers
annotated to each node indicate the number of paths crossing
the node that corresponds to the number of leaves of the tree
version of the DAG. A lower bound on the depth of a DAG
is , being the number of paths of , and assuming
that it can only be transformed by rules that cannot reduce the
number of nodes [8].
Even though C1 and C2 have the same number of nodes and
C2 has more levels than C1, the lower bound on their depth is
different due to their sharing degree. Given that the tree version
of C1 has 17 leaves, a lower bound on the depth of C1 is five
levels. Therefore, any restructuring of C1 that does not reduce
the number of nodes, will never achieve a depth smaller than
five. On the other hand, circuit C2 offers more chances for opti-
mization, since it has only 13 paths and the lower bound on its
depth is four levels. Circuit C3 depicts a possible restructuring
of C2, by applying the associative law, that reduces its depth.
This example shows that executing an aggressive area-oriented
algorithm may prevent one from obtaining the desired number
of logic levels during timing optimization.
In [9], a technique that performs logic decomposition during
technology mapping is proposed. With this approach, the inac-
curacies introduced by splitting these two phases are reduced at
the expense of a high computational cost. More details on this
technique are discussed in Section VII-B.
This paper proposes a different approach that simultaneously
combines functional decomposition [10] and delay optimiza-
tion. In particular, it combines two strategies: tree-based bi-de-
composition of Boolean functions and tree-height reduction of
Boolean expressions.
0278-0070/03$17.00 © 2003 IEEE
676 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003
The approach aims at finding the minimum-depth tree for a
Boolean function. It builds the tree from root to leaves by using
bi-decomposition techniques [11], [12], and reduces the depth
by means of rewrite rules that apply the associative, commuta-
tive and distributive laws of the Boolean algebra.
The relevance of field programmable gate arrays (FPGAs)
based on look-up tables (LUTs) in the last decade has fostered
various efforts in finding effective methods to decomposed
functions [13]–[15]. Since each LUT is able to realize any
arbitrary function up to a certain number of inputs, these methods
are mostly oriented to partition the support of the components.
Our goal, however, is to find efficient decompositions for
cell-based designs in which the functionality of each component
is relevant.
The main contributions of the technique presented in this
paper are the following.
• Bi-decomposition and depth reduction are interleaved
during the global decomposition of a function (Sec-
tion V). The existing approaches perform both functions
as clearly separated steps.
• A heuristic search for the application of transformations
for tree-height reduction is proposed (Section IV-C).
• A new strategy for bi-decomposition based on function
approximations is proposed. This technique subsumes
previous existing approaches based on decompositions
of binary decision diagrams (BDDs). Moreover, algebraic
factorization is also used as an alternative method for
bi-decomposition (Section V-A).
The paper is organized as follows. Section II gives an
overview of the approach and is illustrated with an example.
Section III introduces the representation of binary DAGs
and the rewrite rules. Section IV proposes algorithms for an
efficient exploration of the transformations for tree-height
reduction. Section V presents the main algorithm for logic
decomposition. Experimental results are reported in Section VI.
Finally, Section VII discusses related work.
II. OVERVIEW
This section illustrates the main paradigm of the approach
for logic decomposition. First, some background on tree-height
reduction is presented. Next, a step in the recursive progress of
the main algorithm is described with an example.
A. Tree-Height Reduction: An Example
Tree-height reduction [16] was originally proposed in the
scope of optimizing compilers for the generation of code in
multiprocessor systems. Fig. 2. illustrates an example. The
tree in Fig. 2(a) represents a factored form for the Boolean
expression
(1)
If we assume zero arrival time for all inputs and unit area
and unit delay for each node, the tree is
characterized by the pair ( , ).
The tree in Fig. 2(b) is the one obtained by SIS after executing
the command [6]. This tree is characterized by the
pair ( , ). A more efficient implementation can be
Fig. 2. Equivalent factored forms.
Fig. 3. Area/delay tradeoff for the trees.
found by applying simple transformations (associative and dis-
tributive laws) to the original tree. It is shown in Fig. 2(c) with
( , ). Finally, by further applying transformations,
the tree in Fig. 2(d) can be obtained with ( , ). It
would not be difficult to prove, for this particular example, that
the solutions shown in Fig. 2(a) and (d) are optimal in area and
delay, respectively. The tree obtained by is subop-
timal, since there are other equivalent trees with the same area
and shorter delay [Fig. 2(d)] or the same delay and smaller area
[Fig. 2(c)].
Fig. 3 shows a diagram representing the space of feasible de-
signs for expression (1). The points (7,7), (8,5), and (9,4) are
optimal in the sense that there is no other design that can im-
prove area and delay. However, the point (9,5) obtained by the
command is suboptimal.
B. Algorithm
Fig. 4 depicts an example of the approach presented in this
work. The boxes represent sums of products in matrix form
(each row is a term). The main algorithm uses recursion to de-
compose a Boolean function from root to leaves. Each call in
the recursion tree consists of the following steps.
1) The Boolean function is decomposed into two sub-
functions and a Boolean operator (bi-decomposition).
The methods for bi-decomposition are discussed in
Section V-A.
2) The two subfunctions are decomposed into a binary tree
by a fast algebraic factorization algorithm [17].
3) The binary tree is heuristically balanced by using tree-
height reduction transformations. In the figure, the shad-
owed nodes indicate the points where the distributive law
is applied. The tree is further balanced by applying the
associative law. The algorithms for tree-height reduction
are presented in Section IV.
CORTADELLA: TIMING-DRIVEN LOGIC BI-DECOMPOSITION 677
Fig. 4. Example of timing-driven bi-decomposition.
4) The left and right children of the tree are collapsed and
the process is recursively repeated for each child.
Step 1 uses the full power of Boolean algebra for decomposi-
tion. Steps 2 and 3 are algebraic. For this reason, Step 4 collapses
subtrees in such a way that Boolean decomposition is applied at
each node of the tree.
III. BINARY DAGS AND TREES
Single-output circuits are represented by rooted DAGs. Each
internal node has two children and is labeled with a Boolean op-
erator (AND or OR). Leaf nodes are labeled with (possibly com-
plemented) literals. Henceforth, we will assume that all DAGs
are reduced, i.e., they do not have more than one instance of
isomorphic sub-DAGs under the application of commutativity
to the children. In case they are not reduced, isomorphic copies
of sub-DAGs can be removed by keeping only one of them and
changing the arcs accordingly. This transformation must be it-
eratively applied until no more isomorphic sub-DAGs appear.
Henceforth, we will call binary DAGs (BDAGs) the DAGs rep-
resenting circuits as described above.
A BDAG can be unfolded and uniquely represented by a bi-
nary tree (see Fig. 1). This tree is called the tree version of a
BDAG (denoted by ). Similarly, a tree can be uniquely rep-
resented by a BDAG by sharing all isomorphic subtrees. Given
that BDAGs are reduced, there is a one-to-one correspondence
between BDAGs and binary trees.
Given a binary tree , we will refer to as the root node
or the tree itself. The following nomenclature will be used for
binary trees:
left, right Left and right children
CHILDREN ={ left, right}
op Type of node: , , or literal
Number of leaves of the tree
Depth of the tree.
We can also represent trees as triples
left right
Note that, for binary trees, is equivalent to the number
of nodes of the tree plus one. The depth of a tree is defined as
follows:
if
left right otherwise
The definitions above can easily be extended to BDAGs. The
number of leaves of a tree is analogous to the number of paths of
a BDAG. The number of paths of a BDAG , denoted by
is defined as follows:
if
left right otherwise
Theorem 1:
Proof: Obvious from the one-to-one correspondence be-
tween BDAGs and binary trees.
A. ACD Rewrite Rules
Trees and BDAGs can be transformed by using the commu-
tative (C), associative (A), and distributive (D) laws of Boolean
algebra (ACD-rules)
One of the main subproblems in this work is the exploration of
different BDAG representations for Boolean expressions. This
exploration is done under the assumption that a minimum-size
BDAG is given (e.g., point a in Fig. 3). By iteratively applying
transformations, different solutions are obtained. These solu-
tions draw the curve determined by the optimal solutions with
regard to the area/delay tradeoff.
In order to have a monotonic behavior of the exploration, the
D rule is only applied from left to right, i.e., no transformations
extracting common factors will be used. Although this strategy
impedes a wider exploration of BDAGs, it guarantees termina-
tion and works reasonably for multilevel netlists whose size has
been reduced by an iterative extraction of divisors.
A side effect of using the D-rule from left to right is that
the number of paths of the transformed graph is never reduced.
Hence, the following theorem holds.
Theorem 2 (Lower bound on depth): Let be a BDAG ob-
tained from by applying the ACD-rules. A lower bound for
is
(2)
678 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003
Fig. 5. Illustration of the proof of theorem 3.
Proof: Given that the ACD-rules never reduce the number
of paths, we have that . By Theorem 1, we also
know that . The theorem immediately follows
from the fact that the depth of a binary tree with leaves cannot
be smaller than .
B. Arrival Times
Arrival times at the primary inputs of a circuit can be taken
into account by redefining the depth of a BDAG as follows:
if
left right otherwise
where is the arrival time of the primary input associated
to the leaf node .
The lower bound on the depth of a BDAG can now be recalcu-
lated taking into account arrival times at the primary inputs. For
that, one can assume that an input with arrival time can
be represented as a tree with leaves. This tree mimics
the delay of the input.
Theorem 3 (Lower bound on depth): Let be a BDAG ob-
tained from by applying the ACD-rules. A lower bound for
is
(3)
where is the arrival time of the primary input associated
to the leaf of path in .
Proof: The proof is similar to that of Theorem 2. We first
prove the result for the particular case in which all arrival times
are zero except for one, which is (see Fig. 5). We also assume
that is a natural number. It is easy to see that a perfectly bal-
anced tree with depth will have at most leaves,
which comes from the sum
(the term 1 corresponds to the leaf with late arrival, as shown in
Fig. 5). The expression above can be rewritten as
which indicates that a leaf with arrival time accounts for
leaves with arrival time zero. This result can be extended to mul-
tiple leaves with nonzero arrival times, thus leading to the in-
equality (3). The extension to real numbers for the arrival times
is straightforward.
Note that the expression (3) reduces to expression (2) when
, for any .
C. BDAG Representation
The algorithms presented in this paper have been imple-
mented in a data structure to represent circuits with two-input
gates. This representation is very similar to that of Boolean
Expression Diagrams [18]. A common manager represents all
BDAGs and a single instance of each sub-BDAG in the man-
ager is guaranteed. Internal nodes only represent AND operators
and edges can be complemented to represent negations. For
the sake of simplicity, in this paper we will still distinguish
between AND and OR nodes when depicting circuits.1
The BDAG manager also has cache tables to speed up the re-
cursive algorithms that traverse the circuits from top to bottom
in such a way that operations on reconvergent paths are not re-
peated. The details of the cache management are not shown in
the algorithms presented in this paper.
The circuits in the manager are also organized in equivalence
classes. An equivalence class is a list of circuits that are known
to be equivalent. For example, assume that , , and are cir-
cuits in the manager and that there also exists another circuit
. This circuit belongs to the equivalence class
. After applying the distributive law, the following circuit
can be obtained . Assume that already ex-
isted in the manager with its own equivalence class . Since
and are now known to be equivalent, both classes are
merged in such a way that .
In practice, equivalence classes are implemented as chained
lists that occupy one pointer in each node. For efficiency rea-
sons, the list is ordered by depth and size of the BDAGs. An
auxiliary table keeps track of all equivalence classes in the man-
ager in such a way that knowing whether two functions are in
the same class takes constant time. Equivalent functions under
complementation are also kept in the same class.
IV. ALGORITHMS FOR TIME OPTIMIZATION USING ACD RULES
This section presents algorithms for the exploration of
BDAGs aiming at reducing their depth. First, algorithms for
minimal depth by using only the AC-rules are presented. Next,
an algorithm incorporating the D-rule is proposed.
A. Minimal-Delay Clusters (AC-Rules)
The topmost cluster of is the set of sub-BDAGs closer to the
root that have an operation different from . Formally, the top-
most cluster of a BDAG is obtained by the algorithm CLUSTER
in Fig. 6.
Given a cluster, a minimum-delay tree can be built by com-
bining the elements of the cluster in an appropriate way, trying
the tallest subtrees to be closer to the root. Baer and Boven [19]
proposed an algorithm to build such a tree. It is an iterative al-
gorithm that maintains all elements of the cluster in a priority
1This distinction is also maintained in the package by properly keeping track
of the complemented edges found in the paths.
CORTADELLA: TIMING-DRIVEN LOGIC BI-DECOMPOSITION 679
Fig. 6. Algorithm for minimum-delay clusters.
Fig. 7. Application of MIN DELAY CLUSTERS.
queue ordered by the depth of the elements. At each iteration,
the two shortest elements are extracted and a new tree is built
and inserted in the queue. The algorithm terminates when only
one element is left in the queue, which is the returned tree. This
simple algorithm was proven to be optimal in [20]. It is also
the algorithm used in SIS for minimum-delay decomposition of
AND and OR gates [6], though no proof of optimality was given.
The algorithm MIN DELAY CLUSTERS to obtain a minimum-
delay BDAG by only using the associative and commutative
laws is shown in Fig. 6. The algorithm was proposed in [20]
and was proven to minimize delay. It is a recursive algorithm
that invokes the algorithm by Baer and Boven to build minimum
delay clusters (the “ ” loop).
Fig. 7 depicts an example on the solution derived by the
algorithm. The shadowed areas correspond to the clusters visited
when traversing the tree. Note that the algorithm produces
another tree with the same size, since the associative and
commutative laws do not change the size of the tree. This also
implies that remains the same, although the size of
may vary (increase or decrease) if the sharing of reconvergent
paths is modified.
B. Distributive Law (D-rule)
The distributive law can only be applied to two nodes of a
BDAG, and , for which the following condition holds:
CHILDREN
The transformation is shown in Fig. 8. By itself, the distribu-
tive law cannot provide any performance improvement, since
Fig. 8. Distributive law.
Fig. 9. Application of ACD rules to optimize performance.
the depth of the resulting BDAG is not shorter than the depth
of the original BDAG. It can even produce some performance
degradation if
However, the distributive law changes the structure of the clus-
ters and enables the application of AC-rules that can potentially
result in shorter depths. The combination of D- and AC-rules
is illustrated in the example of Fig. 9. After the application
of a D-rule, a minimum-delay tree is obtained by running the
MIN DELAY CLUSTERS algorithm (AC-rules).
C.
The solution in Fig. 9(e) can only be obtained by applying
the D-rule to certain nodes of the tree. One can immediately see
that this solution cannot be obtained if the D-rule is applied to
the root node of Fig. 9(a). Therefore, the order in which rules
are applied is relevant for searching optimal solutions.
Fig. 10 presents an algorithm for speeding-up a BDAG by
using ACD-rules. It assumes that is an initial BDAG with min-
imal number of nodes, e.g., obtained by area minimization trans-
formations on a Boolean network. The required time, in terms of
number of logic levels, is also another parameter. The algorithm
implements a dynamic programming approach with memoiza-
tion that alternatively applies the D-rule to one of the nodes and
MIN DELAY CLUSTERS to the BDAG. The set col-
lects all the solutions generated in the algorithm.
In order to control the explosion of solutions, a frontier with
limited width is selected at each layer of the search. The width
of the frontier is a factor that can be tuned according to the
exhaustiveness of the search. The selection of “best” solutions
is done by giving priority first to delay (depth of the BDAG) and
second to area (size of the BDAG).
Lower Bound Depth calculates a lower bound on the
depth of the circuit. It corresponds to expression (3). The
680 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003
Fig. 10. Algorithm for speeding-up.
algorithm stops when the depth of the best circuit is not larger
than the maximum of the required time and the lower bound
on depth. The calculation of this bound contributes to prune
the exploration significantly. The algorithm also stops when no
improvement has been observed during a few iterations. The
“improvement” criterion is another tunning parameter of the
algorithm.
V. LOGIC DECOMPOSITION
The decomposition of a Boolean function is performed
recursively from root to leaves by finding an operation op
and two functions, and , such that . This
type of decomposition was originally called quasi-algebraic
decomposition [21] and often referred to as bi-decomposition
[11], [12], [22]. Each level of recursion defines a logic level
of the function.
The main algorithm is shown in Fig. 11. In order to im-
prove the quality of the search, different bi-decomposition
methods can be used in the same framework. The actual
implementation uses two methods, hidden in the function
. One of them is based on finding
algebraic factored forms and the other is based on finding BDD
approximations.
The recursive paradigm behind the ACD DECOMPOSE
algorithm interleaves the generation of bi-decompositions with
the speed optimization by means of ACD SPEED. The function
works in two steps.
1) It finds a bi-decomposition op of the incompletely
specified function defined by (ON,DC), where and
are now completely specified functions. The bi-decom-
position is performed by one of the methods explained in
Section V-A.
2) It decomposes and into a factored form of two-
input operators by using fast methods for algebraic fac-
torization. This step is an attempt to find a reasonable rep-
resentation of the functions and estimate their delay.
After these two steps, the two-input netlist is optimized for
speed ACD Speed ReqTime . The parameter
Fig. 11. Algorithm for logic decomposition.
defines the desired required time for the function.
is measured in logic levels and it is decreased each
time a new recursion level is invoked.
The main algorithm, ACD DECOMPOSE, chooses the best
BDAG obtained from all bi-decompositions. This selection
is done by first giving priority to speed. If the required time
is met by several BDAGs, the one with the smallest area (the
number of nodes) is selected. At this point, the root node of
determines the operator for a new level of logic. The rest of the
netlist is collapsed and prepared for a new level of recursion.
After the decomposition has been done for one of the children
, the observability dc is calculated for the other. As an
example, in case the topmost operation is an AND, the dc-set
for is enlarged when is zero. Since the definition of the
dc-set for each children depends on the order in which they are
decomposed, the slowest one is always decomposed last. In
this way, it has more chances to have a larger dc-set.
The satisfiability and observability dcs calculated at each
node of the tree are propagated down during the recursive
decomposition.
A. Bi-Decomposition Methods
Two bi-decomposition methods are used in the actual imple-
mentation of the decomposition algorithm.
The first is a factorization based on the search of kernels and
algebraic division [17]. In the current implementation, this fac-
torization is implemented by the function in SIS
[23].
CORTADELLA: TIMING-DRIVEN LOGIC BI-DECOMPOSITION 681
Fig. 12. Conjunctive decomposition by function approximations.
The second approach uses the power of Boolean algebra and is
computationally more expensive. It is based on the calculation
of function approximations. Fig. 12 depicts an example of the
approach for the conjunctive decomposition2 of a function .
The cells with label denote the original ON-set of the function.
The cells with label 1 represent over-approximations. The aim
of the method is to calculate two functions and such that
. A necessary condition is that
The method iteratively calculates over-approximations of
and the associated conjuncts by Boolean minimization
using the observability dc derived from . The more accurate
the approximation is, the larger the dc is to minimize .
The K-maps in Fig. 12 represent a sequence of approximations
starting with an initial exact approximation .
The actual method presented in this paper uses BDDs to cal-
culate function approximations; it is inspired on the approach
presented in [24]. Fig. 13 presents an example for the same func-
tion depicted in Fig. 12. The approach consists of remapping
some nodes of in such a way that the BDD size is reduced but
the number of minterms of the new BDD is not increased too
2The approach for disjunctive decomposition is similar, but using under-ap-
proximations instead.
Fig. 13. BDD-based decomposition.
much (a dense over-approximation). In the figure, the approxi-
mation is calculated by remapping the node into the constant
1. is reduced by two nodes and the number of minterms
is increased by two. Once is known, can be calculated by
BDD minimization: . This process is iteratively
executed to generate a sequence of approximations as in Fig. 12.
A cost function based on BDD sizes is used to select one of the
approximations ( , ) in the sequence.
The actual BDD-based approach used in this paper is similar
to the one in [24], but considers many more nodes as candidates
for replacement (same level, children, and grandchildren).
It is important to notice that the approximation approach
subsumes the conjunctive and disjunctive bi-decompositions
proposed by other authors [22], [25] in which the BDD transfor-
mations can be reduced to remapping some nodes into constants
or other nodes of the same BDD. Only the particular heuristics
used in each approach may lead to different decomposition
results in practice.
Iterative Calculation of Observability Don’t Cares (ODC):
By observing the example in Fig. 12, it is easy to realize that
the ODC for can be recalculated after the minimization of
. This process can be repeated until a satisfactory solution
is found. The following loop could be executed to improve a
conjunctive decomposition with a given satisfiability
dc:
repeat
;
miminize ;
;
miminize ;
until no improvement.
In practice, the experimental results have shown that the ini-
tial decomposition is rarely improved by this loop.
VI. EXPERIMENTAL RESULTS
The strategy presented in this paper has been implemented in
SIS. The results have been compared with SIS and the method
for bi-decomposition presented in [11]. The experiments have
been run on a subset of small and medium size Microelectronics
Center of North Carolina benchmarks. Table I describes the
682 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003
TABLE I
SCRIPTS USED FOR THE EXPERIMENTAL RESULTS
Fig. 14. Cluster collapsing (cl collapse) of the circuit in Fig. 7.
scripts used for the experiments. The suffix “ 4” indicates that
the script has been run four times.3
All the benchmarks were multilevel netlists. Initially, the cir-
cuits were collapsed and converted into two-level forms. After
that, the algebraic script was the one deriving the
best results for SIS. The scripts - are the ones implementing
the strategy of this paper. derives a tree decom-
position (no sharing between isomorphic subtrees).
transforms the tree into a DAG by sharing all isomorphic sub-
trees. This is achieved by algebraic resubstitution.
Finally, also tries to share common subex-
pressions within the final clusters of the DAG. For example, if
one cluster implements the expression and an-
other implements , they will be re-expressed
as and , sharing , even
though the depth of the circuit can be increased by sharing the
common subexpressions. This is achieved by collapsing all clus-
ters in the DAG (command , see Fig. 14) and ex-
tracting common cube divisors.
Table II reports the results. After logic decomposition, all the
circuits have been mapped into the library , which includes
a rich set of static CMOS gates up to four-input NAND/NOR and
six-input AOI/OAI gates. The column reports the number of
levels of the circuit before technology mapping, counted as the
depth of the circuit represented with two-input gates (inverters
are ignored).
obtains a 20% delay reduction at the expense
of 35% area increase. If sharing is allowed ( and
) the delay reduction is more moderate (11%
and 8%, respectively), but area is significantly better (only
8% and 4% increase, respectively). The delay increase of
and with regard to is
due to two factors, mainly: 1) the capacitive load of the shared
nodes and 2) suboptimality of the tree-mapping algorithm
when working on DAGs. In some circuits, (e.g., 9symml and
frg1) area is drastically reduced due to the power of Boolean
bi-decomposition.
It is important to emphasize that the ACD scripts could re-
duce, on average, almost one level of logic with regard to alge-
braic .
3Experimentally, we found this number to be adequate to obtain good-quality
results.
Fig. 15 plots the normalized average results. It is interesting
to see that and produce suboptimal results for the
area/delay tradeoff. It is also important to observe that there is a
potential space of configurations between the points
and . This space can be explored by partially sharing
the isomorphic subtrees produced by ACD-tree (e.g., by sharing
subtrees in noncritical paths only). We believe that results with
delay similar to and area similar to
could be obtained by using DAG covering [26] and gate dupli-
cation techniques [27] during technology mapping.
The overall CPU time of the algorithms is about a factor
of two the CPU time of the script, and compa-
rable to that of the script. In , and
, most of the CPU time is spent on the com-
mand, whereas the scripts evenly distribute the effort be-
tween finding decompositions and balancing them.
The results were all obtained by collapsing the whole net-
work. This brute-force approach cannot be applied for large net-
works. In the future, we foresee combining partial collapsing
and decomposition to manage much larger examples. The two
largest examples that we decomposed were (135 inputs,
99 outputs, and 803 gates) and vda (17 inputs, 39 outputs, and
1237 gates).
Results were also obtained with the script
without collapsing the netlist, i.e., transforming the original
netlist described by the benchmark. The results were, in
general, worse than those obtained by collapsing.
A. Impact of Different Bi-decomposition Methods
The current approach for bi-decomposition combines two
methods, algebraic and BDD-based approximations, but which
is the impact of each method on the quality of the bi-decom-
positions? To study this impact, the script was
run with different bi-decomposition methods. The results are
summarized in Table III.
The first two rows report the results obtained by using the
algebraic bi-decomposition without and with the BDD-based
bi-decomposition, respectively. The contribution of the
BDD-based bi-decomposition is manifested in the reduction of
number of levels and delay after technology mapping.
It is important to mention that the BDD-based decomposition
is only chosen in a small percentage of times (typically between
5% and 10%, depending on the example). In many cases, the
BDD-based decomposition derives the same solution as the al-
gebraic decomposition. But the most interesting aspect is that
the contribution of the BDD-based decomposition is more im-
portant at the topmost nodes of the tree, when the function is
still complex and offers many different possibilities for nonal-
gebraic decompositions. It is at the topmost levels when the de-
cisions have a more tangible impact on the final solution. The
decompositions close to the leaves of the tree are almost always
algebraic, especially when the functions become unate.
Another experiment was also performed to check the con-
tribution of the method proposed in this paper with regard to
the approximation method presented in [24], subsumed by the
former. The results reported in the third row of Table III show
that, although small, there is still a contribution of the new
method in number of levels and delay.
CORTADELLA: TIMING-DRIVEN LOGIC BI-DECOMPOSITION 683
TABLE II
EXPERIMENTAL RESULTS
PI: primary inputs; PO: primary outputs; L: depth (levels of two-input gates); D,A: delay and area (normalized with respect to algebraic). The area in algebraic
has been divided by the area of a NAND2 gate. The average values for delay and area in algebraic have also been normalized to 1.00. The number of levels for the
three ACD methods is the same.
Fig. 15. Summary of results in Table I.
TABLE III
IMPACT OF BI-DECOMPOSITION METHODS USING THE ACD-TREE SCRIPT
VII. RELATED WORK AND DISCUSSION
[6] uses a strategy of partial collapsing and resyn-
thesis of critical paths. Resynthesis of each node is performed
by extracting kernel-based divisors that reduce the arrival time
at the output.
Restructuring by applying the ACD rules works at a finer
level of granularity than kernel-based decomposition and may
potentially lead to better results, as it was illustrated by the ex-
amples in Section II-A. Additionally, the ACD SPEED algorithm
has been designed in such a way that it can explore more so-
lutions even though no global improvement has been observed
during few iterations. These two features result in a better explo-
ration of the space of solutions, at the expense of more compu-
tational cost. However, the experimental results show that this
cost is still affordable.
A. Sharing Before Reducing Depth
The experimental results also manifest the problems of
speeding up networks that have been highly optimized for
area. The results obtained by the script are inferior,
684 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 6, JUNE 2003
on average, than those obtained by the script.
As an example, we took from the benchmark suite
and compared the networks before executing the
command. Here are the results:
algebraic
nodes levels
rugged
nodes levels
before speed up
after speed up
The algebraic script initially derives a slightly larger netlist
(718 nodes, each node is a two-input gate) with regard to the
rugged script (711 nodes). However, the number of logic levels
is much higher for the , due to the more aggres-
sive sharing. This fact has a tangible impact when trying to speed
up the netlist. The result obtained by the rugged script ends up
having a larger number of nodes and logic levels. This example,
in particular, and the average results in Table II illustrate the phe-
nomenon mentioned in the introduction of this paper (Fig. 1).
B. Logic Decomposition During Technology Mapping
In [9], a combined approach for logic decomposition and
technology mapping was proposed. The strategy consists
of generating all possible decompositions of a circuit and
representing them compactly encoded in a graph. The decom-
positions are generated by applying local transformations that
correspond to the ACD rules described in this paper.4
In principle, one might think that the exploration power of
both techniques is the same, except for the inaccuracy intro-
duced by heuristics. However, the approach in [9] only uses the
D-rule in its factoring direction, i.e.,
This limitation is crucial for the reduction of the depth of a
circuit. As an example, the circuit in Fig. 9(a) would never be
changed by the local transformations in [9], i.e., the closure of
the circuit would be itself. On the other hand, the application of
the D-rule in its expanding form enables the exploration of more
efficient solutions, as shown in Fig. 9(b)–(e).
The incorporation of the expanding form of the D-rule,
increases the exploration space exponentially. Given a known
upper bound on the minimum depth of the circuit,5 all trees
up to nodes could potentially lead a solution with depth
not larger than , as shown in Theorem 2. This is the main
reason why an exhaustive exploration can be computationally
expensive and the heuristic search of the ACD SPEED algorithm
(Fig. 10) is proposed.
To emphasize the difference between both approaches, we
run few small benchmarks with an implementation of the tech-
nology mapping algorithm in [9].6 As an example, the results
4The inverter transformations in [9] are naturally covered by the comple-
mented arcs in the BDAGs.
5An upper bound can always be found by taking the depth of a trivial imple-
mentation, e.g., a sum-of-products implemented with two-input gates.
6A prototype was designed by [28]. Only very small examples were executed
due to the complexity of the algorithm and the naive implementation of some
data structures in the first prototype of the algorithm.
for the and scripts for with
this technology mapper are the following:
algebraic with ACD-cluster with
graph map [9] graph map [9]
delay area delay area
This result confirms that: 1) the algorithm by Lehman and
Watanabe can find better solutions than a conventional tree
mapper (see the corresponding result in Table I) and 2) the
technique presented in this paper is not subsumed by Lehman
and Watanabe’s approach.
VIII. CONCLUSION
This paper has presented an approach for decomposing logic
functions. It aims at reducing the number of logic levels of the
network and succeeds in doing so for many examples, compared
with previous existing techniques. However, there are still many
questions on the air: how far are we from optimum solutions?
Would it be possible to calculate tight lower/upper bounds on the
depth of a circuit implementing a Boolean function? How much
area must we pay to reduce one logic level? More research is
needed in this direction.
This paper has shown that the area/delay tradeoff can be fur-
ther explored and tangible improvements can still be obtained
with regard to previous techniques.
ACKNOWLEDGMENT
The author wishes to thank the reviewers for their suggestions
to improve the paper.
REFERENCES
[1] M. Fujita and R. Murgai, “Delay estimation and optimization of logic
circuits: a survey,” in Proc. Asia South Pacific Design Automation Conf.,
1997, pp. 25–30.
[2] R. Rudell, “Logic synthesis for VLSI design,” Ph.D. dissertation, Univ.
Calif., Berkeley, Apr. 1989.
[3] J. Vasudevamurthy and J. Rajski, “A method for concurrent decomposi-
tion and factorization of Boolean expressions,” in Proc. Int. Conf. Com-
puter-Aided Design, 1990, pp. 510–513.
[4] K. Barlett, R. Brayton, G. Hachtel, R. Jocoby, C. Morrison, R. Rudell, A.
Sangiovanni-Vincentelli, and A. Wang, “Multi-level logic minimization
using implicit don’t cares,” IEEE Trans. Computer-Aided Design, vol.
7, pp. 723–740, June 1988.
[5] K. Chen and S. Muroga, “Timing optimization for multi-level combi-
national circuits,” in Proc. ACM/IEEE Design Automation Conf., 1990,
pp. 339–344.
[6] K. Singh, A. Wang, R. Brayton, and A. Sangiovanni-Vincentelli,
“Timing optimization of combinational logic,” in Proc. Int. Conf.
Computer-Aided Design, Nov. 1988, pp. 282–285.
[7] H. Touati, H. Savoj, and R. Brayton, “Delay optimization of combina-
tional circuits by clustering and partial collapsing,” in Proc. Int. Conf.
Computer-Aided Design, Nov. 1991, pp. 188–191.
[8] A. Gibbons and W. Rytter, Efficient Parallel Algorithms. Cambridge,
U.K.: Cambridge Univ. Press, 1988.
[9] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness, “Logic decom-
position during technology mapping,” IEEE Trans. Computer-Aided De-
sign, vol. 16, pp. 813–834, Aug. 1997.
[10] R. Ashenhurst, “The decomposition of switching functions,” in Proc.
Int. Symp. Theory Switching, vol. 29, 1959, pp. 74–116.
[11] A. Mishchenko, B. Steinbach, and M. Perkowski, “An algorithm for
bi-decomposition of logic functions,” in Proc. ACM/IEEE Design Au-
tomation Conf., June 2001, pp. 282–285.
CORTADELLA: TIMING-DRIVEN LOGIC BI-DECOMPOSITION 685
[12] S. Yamashita, H. Sawada, and A. Nagoya, “New methods to find op-
timal nondisjoint bi-decompositions,” in Proc. ACM/IEEE Design Au-
tomation Conf., 1998, pp. 59–68.
[13] S.-C. Chang, M. Marek-Sadowska, and T. Hwang, “Technology map-
ping for TLU FPGA’s based on decomposition of binary decision dia-
grams,” IEEE Trans. Computer-Aided Design, vol. 15, pp. 1226–1235,
Oct. 1996.
[14] T. Sasao, “FPGA design by generalized functional decomposition,” in
Logic Synthesis and Optimization. Norwell, MA: Kluwer, 1993, pp.
233–258.
[15] C. Scholl, Functional Decomposition With Application to FPGA Syn-
thesis. Norwell, MA: Kluwer, 2001.
[16] D. Kuck, The Structure of Computers and Computation. New York:
Wiley, 1978.
[17] R. Brayton and C. McMullen, “The decomposition and factorization of
Boolean expressions,” in Proc. Int. Symp. Circuits Syst., May 1982, pp.
49–54.
[18] H. Andersen and H. Hulgaard. Boolean expression diagrams. presented
at IEEE Symp. Logic Comput. Sci.. [Online] citeseer.nj.nec.com/an-
dersen97boolean.html
[19] J. Baer and D. Bovet, “Compilation of arithmetic expressions for parallel
computations,” in Proc. IFIP Congress North-Holland, The Nether-
lands, 1968, pp. 340–346.
[20] J. Beatty, “An axiomatic approach to code optimization for expressions,”
J. Assoc. Comput. Mach., vol. 19, no. 4, pp. 613–640, Oct. 1972.
[21] T. Stanion and C. Sechen, “Quasialgebraic decompositions of switching
functions,” in Proc. 16th Conf. Adv. Res. VLSI, 1995, pp. 358–367.
[22] C. Yang, M. Ciesielski, and V. Singhal, “BDS: a BDD-based logic opti-
mization system,” in Proc. ACM/IEEE Design Automation Conf., June
2000, pp. 92–97.
[23] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Sal-
danha, H. Savoj, P. R. Stephan, R. K. Brayton, and A. Sangiovanni-Vin-
centelli, “SIS : A System for Sequential Circuit Synthesis,” Tech. Rep.
Univ. Calif., Berkeley, May 1992.
[24] K. Ravi, K. McMillan, T. Shiple, and F. Somenzi, “Approximation and
decomposition of binary decision diagrams,” in Proc. Design Automa-
tion Conf., 1998, pp. 445–450.
[25] Y.-T. Lai, K.-R. Pan, and M. Pedram, “OBDD -based function decompo-
sition: algorithms and implementation,” IEEE Trans. Computer-Aided
Design, vol. 15, pp. 977–990, Aug. 1996.
[26] Y. Kukimoto, R. Brayton, and P. Sawkar, “Delay-optimal technology
mapping by DAG covering,” in Proc. Design Automation Conf., 1998,
pp. 348–351.
[27] A. Srivastava, R. Kastner, and M. Sarrafzadeh, “Timing driven gate du-
plication: complexity issues and algorithms,” in Proc. Int. Conf. Com-
puter-Aided Design, Nov. 2000, pp. 447–450.
[28] D. Bañeres, Algorithm for logic decomposition and technology map-
ping, Facultat d’Informàtica de Barcelona, Barcelona, Spain, July 2002.
Jordi Cortadella (S’87–M’88) received the M.S.
and Ph.D. degrees in computer science from the
Universitat Politècnica de Catalunya, Barcelona,
Spain, in 1985 and 1987, respectively.
He is a Professor in the Department of Software,
Universitat Politècnica de Catalunya. In 1988,
he was a Visiting Scholar at the University of
California, Berkeley. His research interests include
formal methods and computer-aided design of very
large scale integration systems with special emphasis
on asynchronous circuits, concurrent systems, and
logic synthesis. He has coauthored over 100 research papers in technical
journals and conferences.
Dr. Cortadella has served on the technical committees of several international
conferences in the field of Design Automation and Concurrent Systems. He
served as a Symposium Co-Chair for the 5th International Symposium on Ad-
vanced Research in Asynchronous Circuits and Systems.
