Advanced Datapath Synthesis using Graph Isomorphism by Yu, Cunxi et al.
Advanced Datapath Synthesis using Graph Isomorphism
Cunxi Yu1, Mihir Choudhury2, Andrew Sullivan2, Maciej Ciesielski1
ECE Department, University of Massachusetts, Amherst1
IBM T.J Watson Research Center2
ycunxi@umass.edu, choudhury@us.ibm.com
Abstract - This paper presents an advanced DAG-based algorithm
for datapath synthesis that targets area minimization using logic-level
resource sharing. The problem of identifying common specification
logic is formulated using unweighted graph isomorphism problem,
in contrast to a weighted graph isomorphism using AIGs. In the
context of gate-level datapath circuits, our algorithm solves the un-
weighted graph isomorphism problem in linear time. The experiments
are conducted within an industrial synthesis flow that includes the
complete high-level synthesis, logic synthesis and placement and
route procedures. Experimental results show a significant runtime
improvements compared to the existing datapath synthesis algorithms.
Index Terms—Logic synthesis, datapath synthesis, resource sharing,
graph isomorphism
I. INTRODUCTION
Due to a large demand for computing, the complexity of hardware
systems have been significantly increasing, raising the challenges in
design, verification and synthesis to a new level. In the last ten years,
there has been a push to make changes in optimization algorithms of
EDA tools to improve their performance in terms of timing, area and
power. Particularly affected are datapath modules in microprocessors
and embedded systems which play an important role in computations,
which puts new demands on logic synthesis. Traditional datapath
synthesis flow includes extraction of arithmetic operations from RTL
code, high-level synthesis (HLS), logic synthesis, and technology
mapping [1][2]. Datapath synthesis techniques have been mainly
discussed in the context of traditional high-level synthesis research,
such as resource sharing, scheduling and binding, relied on Data
Flow Graph (DFG) representation [3][4][5]. Arithmetic operations
such as addition, multiplication, shifting and comparison, and control
logic are extracted and modeled as block modules. At the same time,
methods such as carry prefix, and recoded partial product based
techniques are applied for delay optimization [6]. The remaining
part of the design flow produces the technology mapped netlist using
standard-cell library.
Even though most of the datapath synthesis effort is spent in
the high-level synthesis stage, there are many unexplored opportu-
nities in bit-level optimization that could improve results of high-
level synthesis. Recently, high-level optimization techniques, such as
resource sharing, have been applied in logic synthesis to overcome
some of the limitation of datapath synthesis for standard-cell designs.
Specifically, a Directed Acyclic Graph (DAG) based logic synthesis
technique that targets area minimization of datapath designs was
proposed in [7]. It is a structural optimization technique implemented
using And-Inv-Graphs (AIGs) [8], which offers bit-level resource
sharing. The method includes three steps: 1) identifying sub-circuit
candidates by searching a multiplexer-equivalent AIG structure; 2)
identifying common specification logic using graph isomorphism;
and 3) finalizing the optimization by relocating multiplexers across
common logic. The most critical part of the technique is step 2, which
solves the problems of identifying common logic and performing
Boolean matching. In fact, finding isomorphism in AIG is a weighted
graph isomorphism problem [7]. This is because, to represent an
arbitrary Boolean network using AND nodes, the edges are required
to represent inversion or a wire, which classifies an AIG as a weighted
graph. Note that solving graph isomorphism in weighted graphs is
much more complex than in the unweighted graphs [9].
Although the technique of [7] offers new direction in datapath
synthesis and promises area reduction, it has some limitations. First,
the complexity of general graph isomorphism problem belongs to NP,
but is not known if it is P or NP-complete. Despite the reduction in
complexity offered by DAGs, solving a weighted DAG isomorphism
could still cause memory and runtime explosion. Furthermore, since
that technique is implemented based on AIG, it requires trans-
formations between gate-level network and AIG representations to
produce the technology mapped netlist. These transformations could
affect the optimization solutions performed by the previous synthesis
procedures.
In this work, we develop new algorithms to overcome these
limitations. Specifically, we make the following contributions:
1) We propose a novel algorithm for identifying common spec-
ification logic that directly supports arbitrary standard-cell netlist,
without using AIG, which maintains the optimizations performed by
other synthesis techniques.
2) Instead of solving weighted graph isomorphism problem, the
proposed algorithm formulates the problem as unweighted graph
isomorphism, which significantly reduces the complexity of solving
the problem.
3) The runtime complexity comparison between the AIG-based
algorithm [7] and the one presented here is provided using illustra-
tive examples (Section 3.1), and demonstrated using large datapath
designs (Figure 6).
4) The proposed algorithm allows approximate isomorphism classes
to be optimized (Section 3.2).
5) This approach has been evaluated in two complete IBM synthesis
flows, including the complete flow of high-level synthesis, logic
synthesis and place and route (P&R), which allows it to make
meaningful comparison with other techniques. The experiments were
performed using 14nm technology library.
II. BACKGROUND
A. Boolean Network
A Boolean network can be represented using directed acyclic
graph (DAG) with nodes representing logic gates and directed edges
representing wires connecting the gates. If the network is sequential,
the memory elements are assumed to be D flip-flops with known
initial states. In this work, we only consider combinational logic
optimization, which means the flip-flops are considered as primary
inputs (PI) and primary outputs (PO) for the sub-circuits.
In the AIGs [8], each node has either 0 or two incoming edges.
A node with no incoming edges is a primary input. Primary outputs
are represented using special output nodes without output edges. Each
ar
X
iv
:1
70
8.
09
59
7v
1 
 [c
s.A
R]
  3
1 A
ug
 20
17
b  c  a         d
f
4
2 3
1
d a b c
f
3
1 2
b c a c
f
(a) (b) (c)
Fig. 1: (a) Gate-level netlist f=bc + a · ad; (b) AIG representation,
f=f2f3, f2=f1a, f3=ad, and f1=bc; (c) the proposed representation,
node1 representing AOI21, and node2 and node3 representing the
two NAND2s.
internal node represents a Boolean AND function. The combinational
logic of an arbitrary Boolean network can be transformed into an AIG
[10], while the edges can optionally provide inversions. Hence, AIG
is considered as a weighted DAG.
Alternatively, the Boolean network can be directly represented
using the gate-level netlist. The primary inputs, primary outputs,
and flip-flops are constructed based on standard-cell netlist. Each
logic gate is a vertex in the DAG. The logic gates with the same
corresponding logic function are considered as the same vertex type.
This DAG has only one type of edge, i.e. unweighted DAG, and
provides more uniqueness for checking isomorphism. The compari-
son between AIG and our representation is shown in Figure 1. The
actual gate-level netlist, including one AOI21 and two NAND2 gates,
is shown in Figure 1(a), and its AIG representation is shown in
Figure 1(b). AIG requires four AIG nodes with four inversion edges
and five non-inversion edges to represent this netlist. In contrast,
the proposed representation in Figure 1(c), has three nodes in two
types, and all edges are identical. There are several advantages of
the representation shown in Figure 1(c) that we adopted in our work:
1) avoid the transformations between different Boolean network to
maintain the original structural, which maintains the optimizations
done in previous stages; 2) convert the weighted graph isomorphism
problem into unweighted graph isomorphism problem to improve the
runtime for identifying common specification logic.
B. Common Specification Logic
Two combinational circuits are considered as common specification
logic if they have the same specification [11]. In this work, common
specification logic has to be identified in the following context: given
the output boundaries of two logic cones, find the input boundaries
that result in maximum common logic such that the signals of the
input boundaries match. Most techniques for checking if two designs
conform to common specification logic are based on combinational
equivalence checking (CEC). This problem has been addressed by
BDDs [12], SAT[13][14], AIG[10], etc. However, those methods
cannot be applied in this work for the following reasons: 1) the input
boundaries of the designs are unknown; and 2) if the input boundaries
are detected, the relationship (Boolean matching) of those inputs is
unknown. Furthermore, it is well known that the functional methods
such as BDDs and SAT, are not scalable for gate-level arithmetic
designs, such as multipliers.
C. Graph Isomorphism
In graph theory, an isomorphism of graphs G and H is a bijection
between the vertex sets V (G) and V (F ), f : V (G) → V (F ), such
that any two vertices u and v of G are adjacent in G iff f(u) and f(v)
are adjacent in H . Besides the mathematical research on graph iso-
morphism, the algorithmic approach to graph isomorphism has been
widely used in computer engineering, e.g. Boolean matching [15] and
program similarity checking [16]. In general, graph isomorphism is
applicable to undirected, unlabeled, unweighted graphs. Its is known
to be an NP problem, but neither a NP-complete nor a P using a
deterministic algorithm. However, in the context of Boolean network,
this problem could be solved efficiently by heuristic algorithms. In
this work, we propose a novel algorithm that reduces the number of
reordering operations by employing fanin-fanout information of each
node (i.e. standard cell) for checking the existence of an isomorphism
between two directed acyclic graphs.
III. APPROACH
The overall methodology of our approach is in three steps. Vector
multiplexer is a set of 2-to-1 multiplexers with the identical control
signals. First, they are collected by first structurally reverse engi-
neering all the 2-to-1 multiplexers from gate-level netlist [7], and
then being classified based on their control signals. Note that the
multiplexers are eliminated from the collection if any of their data
inputs has a fanout. In case of large multiplexers, such as 64-to-
1, they are decomposed into 2-to-1 multiplexers [17]. Second, a set
of sub-circuits is created based on these vector multiplexers. Each
sub-circuit is a combinational logic cone whose primary outputs are
the outputs of all multiplexers in the vector multiplexer. These two
procedures are pre-processing step. Third, a multiplexer relocation
function is applied to each output of the sub-circuit iteratively. The
order of applying multiplexer relocation is sorted by the number of
logic gates per multiplexer in the sub-circuit. The original design will
be updated if the area of the sub-circuit is improved by relocating
the multiplexers, i.e., moving the multiplexers backward without
changing the functionality of the design. The resulting updated
standard-cell netlist, and will be subjected to the remaining logic
synthesis steps and eventually to physical design.
A. Exact Isomorphism Determination
Even though the multiplexer relocation is applied to a sub-circuit
that includes vector multiplexers at the primary outputs, the actual
relocation is done individually for each multiplexer. The goal of mul-
tiplexer relocation is to maximize sharing of common specification
logic that are the input cones of the multiplexers, by moving the
multiplexers backward. The main challenge is to identify the common
specification logic in the sub-circuits created by pre-processing step.
Specifically, this requires performing common structure identification
and Boolean matching. According to the definition of graph isomor-
phism, the algorithm proposed in [7] determines the isomorphism
boundary between two graphs using breath-first-search. To obtain the
maximum common logic, a look-ahead heuristic is applied in case of
there are multiple identical choices of constructing isomorphism. This
could potentially cause an exponential runtime and memory explosion
problem, especially in the design with many reconvergent fanouts. In
this section, we introduce a novel algorithm to improve the runtime
and scalability for identifying common specification logic.
1) Standard-cell based DAG advantages: Instead of using AIG
representation, the standard-cell based representation gives two ad-
vantages: 1) some optimization efforts in other stages of the synthesis
flow, that may disappear during the transformations between AIG and
standard-cell netlist are maintained; 2) standard-cell representation
significantly reduces the possible choices for checking the existence
of isomorphism. For this advantage, there are three reasons: (a) in
each topological level, the total possible pairing choices is reduced;
(b) edge type is no longer necessary to be considered, which makes
the isomorphism problem to be unweighted; and (c) utilizing the
number of inputs and outputs of each standard-cell reduces the
number of possible choices when checking isomorphism, especially
in the representation of logic circuits. We demonstrate these using an
example in Figures 2 and 3.
data0 data1
g0
g1
g2 g3 g4
g5
g6
g7 g8 g9e e’
c    d          a   b         c    d c’  d’                a’   b’   c’   d’
0 1 s
Fig. 2: Determine graph isomorphism using standard-cell based DAG.
Example 1 (Figure 2) The standard-cell netlist is shown in Figure
2. Signals data0 and data1 are the two inputs to a 2-to-1 multiplexer.
Signals a, b, c, d, and e are the primary inputs. In each logic cone, the
first two levels logic includes one AOI21 and three NAND2 gates.
Each gate is considered as a vertex. The determination process starts
with g0 and g5. Then, two vectors of vertices are created using breath-
first-search since g0 and g5 are the same type vertices. V0={g1, g4},
V1={g7, g6}. To maintain the traversed graphs in the isomorphic
class, there exists only one pairing choice, i.e. (g1, g6), (g4, g7). The
two vectors will be updated, V0={g2, e, g3} and V1={g2, e′, g3}.
Since x and y are primary inputs, they are paired and eliminated from
V0 and V1. Hence, we have two NAND2 vertices in each vector,
which has two pairing options, i.e. (g2, g8) or (g2, g9). However,
in the standard-cell based DAG, only one option remains. This is
because AOI21 has two types of inputs, including two inputs for AND
and one input for OR/NOR. To maintain the function equivalence,
g2 must pair with g8, and so g3 must pair with g9. In summary, the
total number of possible attempts for determining isomorphism for
the first two level logic is one.
1
2 3
4
5
6 7
8
109
11
12
1413
data1 data2
e e’
c     d        a       b  a’     b’               c’     d’       
Fig. 3: Determining graph isomorphism using AIG.
However, this approach requires much more effort to determine the
maximum isomorphism while using AIG representation. The AIG
representation of this design is shown in Figure 3. According to
the algorithm proposed in [7], the first level logic has two options
for pairing, i.e. node 2 with node 9, or node 2 with node 10. The
algorithm solves this problem using a look-ahead heuristic, which
traverses three levels deeper and picks the pairing that gives more
common logic. This situation happens also while checking (node 4
with node 7, and node 11 with node 13), and (node 6 with node 7,
and node 13 with node 14). This means that it requires three times
look-ahead checking and total of eight attempts to identify the same
common logic as the one shown in Figure 2.
2) Including side fanout information: Based on the observation
shown in Example 1, we can see that providing various types of
vertices at each logic level can significantly reduce the total number
of pairing attempts for isomorphism determination. Thus, we preserve
the fanout information of the standard cells in the vertices. This can
significantly improve the runtime for a large design that includes
many reconvergent fanouts, such as the optimized multipliers.
g0
g1 g2 g3 g4
g5
g6 g7 g8 g9
n0
n1 n2
n3 n4
n5 n6
n7
… … … …
data0 data1
Fig. 4: Illustrative example of utilizing fanout information of each
vertex.
g0
g2
1
0
g1
g3
a
b
c
x
y
z
s
(a) (b)
f
s
s
f
a
z
b
y
c
x
1
0
1
0
1
0
Fig. 5: Approximate isomorphism determination by ignoring invert-
ers. a) original design; b) optimized design using extra XOR2 gate.
Example 2 (Figure 4) Assume that each logic cone of a 2-to-1
multiplexer includes one XOR4 and four NAND2 gates in the first
two levels. Let the number of side fanouts of nets {n0, n1, n2, n3} be
{3,2,1,0}, and the number of side fanouts of nets {n4, n5, n6, n7} be
{1,3,2,0}. Without including the fanout information, the total number
of possible pairing is 24 since four vertices in the second level are
identical. However, if we consider to pair the vertices according to the
number of side fanouts, there will be only one pairing choice, i.e. (g1,
g7), (g2, g8), (g3, g6), and (g4, g9). Although, the fanout information
can significantly reduce the number of pairing, such case may not
always exist. If so, our approach will go through the look-ahead
heuristic pairing process.
B. Approximate Isomorphism Determination
In addition to considering the exact isomorphism graph as common
specification logic, a novel approximate isomorphism determination
approach is developed in this work. One observation is that much
more common logic exists by ignoring the inversions. For example, in
the case of a 2-to-1 multiplexer that selects less than operator and less
than or equal to, there is no common logic that can be identified using
both representations while considering inversions. Thus, we propose
an approximate isomorphism method to overcome this limitation.
Specifically, in the process of identifying common logic, the inverters
will be replaced by a 2-input XOR, with an extra input coming from
the control signal of the multiplexer, or its complement.
Example 3 (Figure 5) The original netlist is shown in Figure
5(a). Using the approach described in Section 3.1, there will be only
one gate in each instance of the common logic, namely g0 and g2.
However, we can see that the two logic cones connected to the 2-to-1
multiplexer are identical without considering the inverter. Hence, we
continue searching for the common logic by skipping the inverters.
In this example, the common logic includes two NAND2 and one
a1
a0
a3
a4
a5
a2
b1
b0b3
b4
b5
b2
AOI22
s         s’
a
c
b d
x0
x1
x2
x3
x4
x5
y0
y1
y2
y3
y4
y5
2-to-1 MUX
F
cones=1
cones=0
(a) Original circuit.
a1
a3
a4
a5
a2
m0
m1
m2
m3
m4
m5
a0
F
i0
s’
i0
i1
(b) Circuit optimized with our approach.
Fig. 6: A complete example of multiplexer relocation using the proposed approach.
inverter. To maintain the original function of f , the inverter is replaced
by an XOR2, whose extra input is the control signal s. In Figure 5(b),
signal s in the XOR2 actually selects the XOR2 to be a inverter or
wire, i.e. when s = 1, XOR2 is a inverter; and when s = 0, XOR2
is a buffer.
C. Implementation
The implementation of single multiplexer relocation is shown in
Algorithm 2. The multiplexer relocation function of sub-circuit with
a vector multiplexer at the primary outputs (line 5 in Algorithm 1),
is applying the single relocation function iteratively on each output
bit. The input of Algorithm 2 is a sub-circuit with single output bit
that is generated by a of 2-to-1 multiplexer. Algorithm 2 operated in
three steps:
Algorithm 1 Single Multiplexer Relocation
Input: Pre-processed sub-circuit C
Output: An optimized standard-cell netlist
Single Mux Relocate(C)
1: B = RelocationBoundray(PO)
2: C ← relocate multiplexer to level B, w/o considering inverters
3: P = inv2xorPosition(PO, B)
4: C ← insert XORs to P based on its location
5: return C
RelocationBoundray(PO)
1: m← levels(PO)− 1; inverter is considered as 0 level.
2: while m ≥ 0 do
3: L0m ← the gates in (s = 0) logic at level m
4: L1m ← the gates in (s = 1) logic at level m
5: if uniqueFanoutPairs(L0m, L1m) then
6: U0m, U1m ← uniqueFanoutPairs(L0m, L1m)
7: L0m ← L0m - U0m; L1m ← L1m - U1m
8: L0m+1, L1m+1 ← (U0m, U1m)+isomorphism(L0m, L1m)
9: else
10: if isomorphsim(L0m, L1m) then
11: L0m+1, L1m+1 ← isomorphsim(L0m, L1m)
12: else
13: exit
14: end if
15: end if
16: end while
17: return (level(PO) - 1 - m), (L0m−1, L1m−1)
inv2xorPosition(PO, boundary)
1: P0 ← the positions of all inverts up to boundary level
2: P1 ← the positions of all inverts up to boundary level
3: return P0 ∩ P1
a) The key function of this approach is identifying the maximum
common specification logic connected to the multiplexer. The func-
tion is described in function RelocationBoundray in Algorithm 2.
Specifically, our algorithm identifies the boundary logic cut where the
isomorphism between two logic cones ends. This function also returns
the pairings of the boundary signals that maintains the isomorphism
class, which is used for creating the new multiplexers.
We backward traverse the graph from the two inputs of the 2-to-
1 multiplexer level by level (lines 1 - 2). The gates at level m are
stored in two vectors (lines 3 - 4), depending their selecting signal. As
mentioned in Section 3.1.2, our approach benefits signicantly from
the fanout information. Hence, we first check if there exist unique
fanout pairs. If so, we eliminate those pairs from the two vectors
that store the gates. The rest of the gates in the two vectors will
do a regular isomorphism check, with a 3-depth look ahead search
[7]. For example, in Figure 6, there are two NAND2 gates in each
vector, (a1, a2) and (b1, b2). There are two pairing choices at this
level, i.e., (a1, b1) and (a2, b2), or (a1, b2) and (a2, b1). Using the
fanout information, there will be only one feasible pairing, i.e., (a1,
b1) and (a2, b2). This is because a2 and b2 have two fanouts, and a1
and b1 have only one fanout.
b) Relocate the multiplexer across the common specification logic,
up to the boundary cut returned by the previous step. The two logic
cones between boundary and the multiplexer output have common
specification (not functionally equivalent), denoted as cones=0 and
cones=1, depending on the select signal of the multiplexer. To
relocate the multiplexer, we disconnect all the pins of cones=1 and
create a set of multiplexers that select the inputs signals of those two
logic cones. For example, in Figure 6, mi=xis+yis¯, i={1,2,3,4,5}.
Then, the inputs of cones=1 will be replaced by the outputs of the
new multiplexers. In this case, xi is replaced by mi. Finally, the
output F will be reconnected to the output of cones=1.
c) In the function of RelocationBoundray, we do not consider
inverter as a gate, or a node in the DAG. This enables the approximate
isomorphism determination (Section 3.2). As mentioned earlier, this
allows us to identify a larger common logic. For example, if we
consider inverter as a node in the graph, the common logic will
consist of only two NOR2 gates, a0 in cones=1 and b0 in cones=0.
To maintain the functionality of the design, we need to insert XOR2
gates with extra input s or s¯ depending on which cone the invert
belongs to. We first record the locations of all inverters in cones=0
and cones=1, denoted as P0 and P1, up to the boundary cut. The
locations that require an XOR2 replacement is included in the result
of P0 ∩ P1. This is why the inverters connected to gates a4 and b4
do not require XOR2 insertion, since they maintain the two cones
in the isomorphism class (Figure 6). The inverter connected to b0
requires an XOR2 insertion, and it belongs to cones=0. Hence, an
XOR2 with extra input s¯ is inserted to replace i0 in Figure 6.
IV. EXPERIMENTAL RESULTS
The proposed approach in this Section 3 was implemented in C++
and integrated with the IBM logic synthesis flow [18] and further
evaluated with IBM high-level synthesis flow and Place and Route
(P&R) flow. Our approach is performed before technology mapping
(n-bit) Operators IBM flow
IBM flow
with AIG Opt
IBM flow
with our approach
Area Lev Area Lev Area Lev
(64), A<B:A<C 2280 11 2124 13 1855 15
(64), A+B, A+C 10162 17 9333 15 5787 20
(64), A+B:A-C 8697 19 8104 25 8062 21
(64) A<B:A<=B 2464 12 2126 13 2198 12
*(64) A×B:A×C 182917 83 482811 211 91245 89
A×B/C[7:0]:A×B/C[15:8] 3626 26 5606 26 1760 27
(32) A×B+C:B×C+A 52943 58 108402 120 26709 58
(6) dec(A):dec(B) 1319 5 667 5 549 7
1 +0 lev 1.106 +1.16 lev 0.658 +2 lev
TABLE I: Results of arithmetic test cases using the original IBM synthesis Flow, IBM synthesis flow with AIG optimization, and original
IBM synthesis flow with the proposed approach.(*This design is not used for comparison.)
Benchmarks Flow1
Flow1 with
our approach Flow2
Flow2 with
our approach
Area Delay Area Delay Area Delay Area Delay
ibm1 3622 216.45 2223 255.81 4587 295.16 2235 255.81
ibm2 5454 314.84 3361 354.19 6879 432.90 3366 383.71
ibm3 9115 501.77 5526 501.77 11463 688.71 5610 541.13
ibm4 12782 678.87 7874 649.35 16047 924.84 7854 688.71
ibm5 18323 924.84 11121 787.10 22923 1200.32 12342 875.65
ibm6 27435 1170.81 16843 983.87 34383 1505.32 16803 1023.23
ibm7 31069 1288.87 19083 1082.26 38967 1603.71 19074 1082.26
1 1 0.613 0.970 1 1 0.487 0.767
TABLE II: Evaluation of our approach in the complete production Flow using industry designs in 14nm technology. Flow1 is the IBM
synthesis flow with AIG optimization; Flow2 is the original IBM synthesis flow.
within the logic synthesis flow. The program was tested on a number
of datapath designs in SystemC. The datapath designs include large
arithmetic operators, such as 64-bit multipliers. All the experimental
results are collected at the end of the complete production design
flow. This demonstrates that our approach successfully overcomes
the limitations of the existing logic synthesis and high-level synthesis
techniques reviewed in Section 1. All of our experimental results
are obtained using high-performance 14nm technology library. To
demonstrate the runtime improvement compared to the work of [7],
we examine the runtime using a set of designs, including a multiplier
circuit up to 64 bits. Our experiments were conducted on a machine
with Intel(R) Xeon CPU 7560 v6 2.20 GHz x32 with 4 TB memory.
 0.01
 0.1
 1
 10
 100
 1000
 10000
 10000  20000  30000  40000  50000
R
un
tim
e 
of
 id
en
tif
yi
ng
 
co
m
m
o
n
 s
pe
ci
fic
at
io
n 
lo
gi
c
Number of standard cells in multipliers
This work
[7]
Fig. 7: Evaluation of CPU runtime using designs with multipliers
compared to [7].
We first evaluate our approach using a set of arithmetic designs in
which there are two arithmetic operators selected by control signals.
The results are shown in Table 1. The first column indicates the bit-
width of the arithmetic operators and the type of the two operators.
These designs are implemented in SystemC using ”if then else”
statement. The second and third columns show the area and logic
level results produced by the original IBM synthesis flow. The fourth
and fifth columns show the results produced by the original flow with
combinational AIG optimization [10]. The last two columns show the
results produced by original flow with our approach. The last row
shows the average improvement gain or loss. Specifically, the increase
or decrease area is measured in percentage of the original flow, and
the change of logic level is measured in the number of levels. Based
on Table 1, we can see that: 1) our approach gives on average
34% area reduction compared to the other two flows. Note that
the flows include complete high-level and logic-level optimizations
techniques; and 2) our approach can handle large complex arithmetic
operators, such as datapath with large multipliers. With approximate
isomorphism determination, we can optimize the design with various
combinations of two different operators.
We then evaluate our approach using seven industrial designs im-
plemented in SystemC. Two synthesis flows are used for experiments:
Flow1 is the IBM synthesis flow with AIG optimization; Flow2 is the
original IBM synthesis flow. The results are shown in Table 2. The
second and third columns show the results produced by Flow1, and
fourth and fifth columns are produced by Flow1 with our approach.
The sixth to seventh columns show the results produced by Flow2.
We compare the average improvement of the area and the delay at
the last row. We can see that both area and delay have been improved
in these experiments. Specifically, using Flow1 the area on average
reduces by 39%, and the delay on average reduces 3%, and Flow2
offers 51% area reduction with 23% delay improvement on average.
Note that the delay improvements are not provided directly by our
approach. The delays are improved because our approach enables
other optimization techniques. Specifically, for those benchmarks, an
Adder optimization technique [6] implemented in the IBM synthesis
flow is enabled and significantly improves the delay after relocating
the multiplexers.
TABLE III: Comparing the PnR results with multiplexer relocation
with the original flow.
Benchmarks Route Length Power Worst-case delay
ibm1 0.73 0.45 0.95
ibm2 0.79 0.61 0.97
ibm4 0.92 0.71 1.06
ibm6 1.23 0.78 1.10
Additionally, we evaluate our approach using four designs, ibm1,
ibm2, ibm4, ibm6, with placement and route (P&R). The inputs of
P&R process are the designs produced by Flow1 with AIG optimiza-
tion (4th and 5th columns in Table 2). The routing length, power and
(a) P&R result of design ibm2 without mul-
tiplexer relocation.
(b) P&R result of design ibm2 with multi-
plexer relocation.
Fig. 8: Comparing the P&R results using design ibm2 with and without our approach.
worst-case delay are included in Table III. The improvements of the
area of placing the standard cells remain the same as shown in Table
2 with the same density. The P&R results of ibm2 are shown in Figure
8. We can see that except ibm6, the designs are improved successfully
using our approach without delay overhead. Particularly, we observe
that the power has been significantly improved compared to the
original designs. Moreover, we can see that the improvements of ibm4
and ibm6 gained after P&R are less than in the other two designs.
The possible reasons for that are: 1) there are large (≥32) fanout
signals generated by multiplexer relocation in those two designs; and
2) a large number of the extra multiplexers have been placed tightly,
which decreases routability.
The reason why we didn’t compare our approach to the work of [7]
in the experiments shown in Table 1 and Table 2 is the following: 1)
that algorithm can’t be successfully applied on all of the design within
eight hours; and 2) for the designs that on which the algorithm runs
successfully, the results are worse, e.g., 3rd and 4th designs in Table
1. To demonstrate that our approach significantly improves the CPU
runtime compared to the existing algorithm in the cases of datapaths
with multipliers, the experimental results are provided in Figure 7.
The designs used for the experimental results shown in Figure 7 vary
from 4-bit to 64-bit. In Figure 7, the x-axis represents the number
of standard cells in the design, and the y-axis represents the CPU
runtime of the multiplexer relocation algorithm in logarithmic scale.
It is clear that our algorithm performs much faster than the AIG-based
algorithm [7].
V. CONCLUSION
This paper presents an advanced DAG-based algorithm that targets
area minimization using logic-level resource sharing. The common
specification logic identification is formulated as unweighted graph
isomorphism problem. In addition, an approximate isomorphism
algorithm is proposed in this paper to identify extra common logic.
The proposed approach demonstrates that it can significantly reduce
area, and potentially reduce delay on industrial designs, within a
complete design flow. The runtime has been reduced from exponential
to linear comparing to the existing algorithms. Future work will focus
on improving function of identifying common specification logic.
REFERENCES
[1] L. Stok, “Data path synthesis,” Integration, the VLSI journal, vol. 18,
no. 1, pp. 1–71, 1994.
[2] G. D. Micheli, Synthesis and Optimization of Digital Circuits. McGraw-
Hill Higher Education, 1994.
[3] M. Potkonjak and J. Rabaey, “Optimizing Resource Utilization using
Transformations,” Computer-Aided Design of Integrated Circuits and
Systems, IEEE Transactions on, vol. 13, no. 3, pp. 277–292, 1994.
[4] M. B. Srivastava and M. Potkonjak, “Optimum and Heuristic Trans-
formation Techniques for Simultaneous Optimization of Latency and
Throughput,” Very Large Scale Integration (VLSI) Systems, IEEE Trans-
actions on, vol. 3, no. 1, pp. 2–19, 1995.
[5] J. Cong and J. Xu, “Simultaneous FU and Register Binding-based on
Network Flow Method,” in Design, Automation and Test in Europe,
2008. DATE’08. IEEE, 2008, pp. 1057–1062.
[6] S. Roy, M. Choudhury, R. Puri, and D. Z. Pan, “Towards optimal
performance-area trade-off in adders by synthesis of parallel prefix
structures,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 33, no. 10, pp. 1517–1530, 2014.
[7] C. Yu, M. J. Ciesielski, M. Choudhury, and A. Sullivan, “Dag-aware
logic synthesis of datapaths,” in Proceedings of the 53rd Annual Design
Automation Conference, DAC 2016, Austin, TX, USA, June 5-9, 2016,
2016, pp. 135:1–135:6.
[8] A. Mishchenko, S. Chatterjee, and R. Brayton, “DAG-aware AIG
Rewriting A Fresh Look at Combinational Logic Synthesis,” in 43rd
DAC. ACM, 2006, pp. 532–535.
[9] S. Umeyama, “An eigendecomposition approach to weighted graph
matching problems,” IEEE transactions on pattern analysis and machine
intelligence, vol. 10, no. 5, pp. 695–703, 1988.
[10] A. Mishchenko et al., “ABC: A System for Sequential Synthesis and
Verification,” URL http://www. eecs. berkeley. edu/alanmi/abc, 2010.
[11] E. Goldberg, “Equivalence Checking of Dissimilar Circuits II,” Technical
report, Tech. Rep., 2004.
[12] R. E. Bryant, “Graph-based Algorithms for Boolean Function Manipu-
lation,” Computers, IEEE Transactions on, vol. 100, no. 8, pp. 677–691,
1986.
[13] A. Kuehlmann and F. Krohm, “Equivalence checking using cuts and
heaps,” in Proceedings of the 34th annual Design Automation Confer-
ence. ACM, 1997, pp. 263–268.
[14] E. Goldberg, M. Prasad, and R. Brayton, “Using sat for combinational
equivalence checking,” in Proceedings of the conference on Design,
automation and test in Europe. IEEE Press, 2001, pp. 114–121.
[15] M. Soeken, B. Sterin, R. Drechsler, and R. Brayton, “Simulation graphs
for reverse engineering,” in Proceedings of 15th FMCAD. FMCAD,
2015, pp. 152–159.
[16] W. Li, H. Saidi, H. Sanchez, M. Scha¨f, and P. Schweitzer, “Detecting
similar programs via the weisfeiler-leman graph kernel,” in International
Conference on Software Reuse. Springer, 2016, pp. 315–330.
[17] S. Mitra, L. J. Avra, and E. J. McCluskey, “Efficient multiplexer
synthesis techniques,” IEEE Design & Test of Computers, vol. 17, no. 4,
pp. 90–97, 2000.
[18] L. Stok, D. Kung, and et al., “BooleDozer: Logic Synthesis for ASICs,”
IBM Journal of Research and Development, vol. 40, no. 4, pp. 407–430,
1996.
