Developing Synthesis Flows without Human Knowledge by Yu, Cunxi et al.
Developing Synthesis Flows Without Human Knowledge
Cunxi Yu
Integrated Systems Laboratory, EPFL
Lausanne, Switzerland
cunxi.yu@ep￿.ch
Houping Xiao
SUNY Bu￿alo
Bu￿alo, NY, USA
houpingx@bu￿alo.edu
Giovanni De Micheli
Integrated Systems Laboratory, EPFL
Lausanne, Switzerland
giovanni.demicheli@ep￿.ch
ABSTRACT
Design ￿ows are the explicit combinations of design transforma-
tions, primarily involved in synthesis, placement and routing pro-
cesses, to accomplish the design of Integrated Circuits (ICs) and
System-on-Chip (SoC). Mostly, the ￿ows are developed based on the
knowledge of the experts. However, due to the large search space of
design ￿ows and the increasing design complexity, developing Intel-
lectual Property (IP)-speci￿c synthesis ￿ows providing high Quality
of Result (QoR) is extremely challenging. This work presents a fully
autonomous framework that arti￿cially produces design-speci￿c
synthesis ￿ows without human guidance and baseline ￿ows, us-
ing Convolutional Neural Network (CNN). The demonstrations are
made by successfully designing logic synthesis ￿ows of three large
scaled designs.
1 INTRODUCTION
Electronic Design Automation (EDA) involves a diverse set of soft-
ware algorithms and applications that are required for the design
of complex electronic systems. Given the deep design challenges
that the designers are facing, developing high-quality and e￿cient
design ￿ows has been crucial. A well-developed design ￿ow could
reduce time-to-market by enabling manufacturability, addressing
timing closure and power consumption, etc. In general, the EDA
vendors provide reference design ￿ows along with the EDA tools.
However, such design ￿ows may not performwell for many designs.
There are two major reasons. First, the performance of the de-
sign ￿ow varies on the Intellectual Property (IP) of the design. To
achieve the design objectives, design ￿ows need to be customized
for the given IP. Such ￿ows are called IP-speci￿c or design-speci￿c
￿ows. This becomes more important while new types of designs
are coming out, e.g., design methods for Neuromorphic chip [1].
Second, the design ￿ows are mostly developed by the EDA devel-
opers and users based on their knowledge and user experience,
with many testing iterations and intensive supervision. However,
due to a large number of available ￿ows, ￿nding the best design
￿ows among the entire search space by human-testing is impossi-
ble. It is particularly di￿cult to ￿nd the best ￿ows for the recently
developed transformations [2]. For example, given 50 synthesis
transformation that each of them can be processed independently.
The total number of available design ￿ows is 50! ⇡ 3 · 1064. The
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for pro￿t or commercial advantage and that copies bear this notice and the full citation
on the ￿rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior speci￿c permission and/or a
fee. Request permissions from permissions@acm.org.
DAC ’18, June 24–29, 2018, San Francisco, CA, USA
© 2018 Association for Computing Machinery.
ACM ISBN 978-1-4503-5700-5/18/06. . . $15.00
https://doi.org/10.1145/3195970.3196026
search space of general ￿ows is formally de￿ned in Section 2.1.
Although the signi￿cant e￿orts spent in providing high-quality
design ￿ows, the technique that systematically generates IP-speci￿c
synthesis ￿ows has been lagging. Similarly, these problems exist
in designing System-on-Chip (SoC). In Section 2 (Figure 1), two
motivating examples are provided to show the needs of developing
such technique.
Design ￿ows are considered as iterative ￿ows since the trans-
formations are applied to the designs iteratively. Machine learning
technique has been leveraged in ￿ow optimization, such as iterative
￿ow optimization for compilers using Markov Chain [3]. Regarding
synthesis ￿ow optimization, Liu et al. recently introduced an area
optimization approach for Look-up-table (LUT) mapping, in which
the logic transformations are guided using Markov Chain Monte
Carlo (MCMC) method [4]. However, Markov Chain model is not
su￿cient in autonomously designing synthesis ￿ows. The main
reason is that the synthesis transformation(s) may not a￿ect the
next transformation but a￿ect the transformation several iterations
later, which does not satisfy the Markov Property [5]. In this work,
we formulate the problems of arti￿cially developing synthesis ￿ows
as a Multiclass classi￿cation problem, and solved using Deep learn-
ing [6]. Deep learning has shown considerable success in tasks like
image recognition [7] and natural language processing [8]. Sev-
eral advances mitigate the de￿ciencies of traditional multilayer
perceptrons (MLPs), e.g., CNNs have made it possible to robustly
and automatically extract learned features; over-￿tting is mitigated
in fully connected layers using the random regularization called
dropout [9].
Speci￿cally, this paper includes the following contributions: a)
The search space of arti￿cially developing synthesis ￿ows is for-
mally de￿ned in Section 2. b) We introduce a ￿ow-classi￿cation
model (Section 3.1) combining with the one-hot modeling of ￿ows
(Section 3.2), such that the problem can be modeled as Multiclass
Class￿cation problem. c) We develop a fully autonomous frame-
work for developing synthesis ￿ows based on Convolutional Neural
Network (CNN). This framework takes HDL as input and output
two sets of synthesis ￿ows, namely angel-￿ows and devil-￿ows that
provide the best and worst QoRs respectively1. d) Our framework
is demonstrated by successively developing delay-driven and area-
driven angel/devil-￿ows for 64-bit Montgomery Multiplier, 128-bit
AES core and 64-bit ALU. Evaluations of the CNN architecture and
training process for classifying synthesis ￿ows are also provided.
e) The datasets and demos are released publicly2.
1devil-￿ows could provide information for improving the synthesis
transformations.
2https://github.com/ycunxi/FLowGen-CNNs-DAC18.git
DAC ’18, June 24–29, 2018, San Francisco, CA, USA C. Yu et al.
6 6.5 7 7.5 8 8.5 9 9.5 10 10.5
Area ( m2) 105
80
85
90
95
100
D
el
ay
 (p
s)
(a) 2-D QoR distro of AES
0
500
100
1000
N
um
be
r o
f D
es
ig
ns
1500
95
2000
10
Delay (ps)
90
Area ( m2)
105885
80 6
(b) 3-D QoR distro of AES
2.5 2.55 2.6 2.65 2.7
Area ( m2) 105
1.4
1.45
1.5
1.55
1.6
1.65
1.7
D
el
ay
 (p
s)
104
(c) 2-D QoR distro of ALU
0
200
1.9
N
um
be
r o
f D
es
ig
ns
400
1.8 2.9
600
1.7 2.8104
Delay (ps)
105
Area ( m2)
1.6 2.7
2.61.5
2.51.4
(d) 3-D QoR distro of ALU
Figure 1: Delay and area results of the 50,000 random ABC synthesis ￿ows of 128-bit AES core and 64-bit ALU.
2 BACKGROUND
2.1 Notations and Search Space
De￿nition 1 none-repetition Synthesis Flow: Given a set of
unique synthesis transformations S={p0, p1,..., pn }, a synthesis ￿ow F
is a permutation of pi 2 S performed iteratively.
Example 1: Let S={p0, p1, p2}. pi are the transformations in the
synthesis tools and can be processed independently. Then, there
are totally six ￿ows available:
F0 : p0 ! p1 ! p2 F1 : p0 ! p2 ! p1
F2 : p1 ! p0 ! p2 F3 : p1 ! p2 ! p0
F4 : p2 ! p0 ! p1 F5 : p2 ! p1 ! p0
Remark 1: Let N be the number of all available ￿ows, where S
includes n elements, such that N  n!.
The upper bound of N happens i￿ all elements in S can be pro-
cessed independently. In practice, there could be some constraints
have to be satis￿ed for processing these transformations. In this
case, N will be smaller than n!. For example, given a constraint that
p1 has to be processed before p2, the available ￿ows include only
F0, F2, and F3.
De￿nition 2m-repetition Synthesis Flow (m 2): Given a set of
unique synthesis transformations S={p0, p1,...,pn }, a synthesis ￿ow
withm-repetition Fm is a permutation of pi 2 Sm , where Sm contains
m S sets.
Example 2: Let S={p0, p1}. Each pi can be processed indepen-
dently. For developing 2-repetition synthesis ￿ows, S2={p0, p1, p0,
p1}. The available ￿ows include:
F0 : p0 ! p0 ! p1 ! p1 F1 : p1 ! p1 ! p0 ! p0
F2 : p0 ! p1 ! p0 ! p1 F3 : p1 ! p0 ! p1 ! p0
F4 : p0 ! p1 ! p1 ! p0 F5 : p1 ! p0 ! p0 ! p1
Remark 2: Let L be the length of a synthesis ￿ow. Given a m-
repetition Fm with n transformations in S, L = n⇥m.
Remark 3: Let function f (n,L,m) be the number of available m-
repetition ￿ows with n elements in S . f (n,L,m) uniquely satis￿es the
following recursive formula :
f (n, L + 1,m) = nf (n, L,m)   n
✓
L
m
◆
f (n   1, L  m,m)
The number of available m-repetition ￿ows with n synthesis
transformations is the same as counting L-permutations of n ob-
jects. The proof of the recursive formula is similar to [10] that will
not be included in this paper. The upper and lower boundary condi-
tions are n! < f (n,L,m) < nL. We can see that f (n,L,m) becomes
dramatically larger than n! (non-repetition ￿ows) asm increasing.
2.2 Motivating Example
We provide two motivating examples using the Open-source logic
synthesis framework ABC [11] shown in Figure 1. The setups are
as follows:
• S={balance, restructure, rewrite, refactor, rewrite -z, refactor -z}
(n=6); the elements in S are logic transformations in ABC3 that
can be processed independently.
• 50,000 unique 4-repetition ￿ows are generated by random per-
mutations of S4 (m=4, n=6, L=24).
• Input designs: 128-bit Advanced Encryption Standard (AES) core,
and 64-bit ALU taken from OpenCore [12].
• Delay and area of these ￿ows are obtained after technology map-
ping using a 14nm standard-cell library.
The QoR distributions of AES and ALU designs using the 50,000
random ￿ows are shown in Figure 1-(a, b) and (c, d). There are
several important observations based on Figure 1, which show the
main motivations of this work:
• Given the same set of synthesis transformations, the QoR is very
di￿erent using di￿erent ￿ows. For example, delay and area of
AES design produced by the 50,000 ￿ows have up to 40% and 90%
di￿erence, respectively.
• The search space of the synthesis ￿ows is large. According to
Remark 3, the total number of available 4-repetition ￿ows with
n = 6 independent synthesis transformations is more than 1016.
Discovering the high-quality synthesis ￿ows with human-testing
among the entire search space is unlikely to be achieved.
• The same set of ￿ows perform di￿erently on di￿erent designs.
For example, in Figure 1, QoR distributions of AES and ALU are
statistically signi￿cant. This means that the high-quality ￿ows for
AES design could perform poorly for ALU. Therefore, synthesis
￿ows need to be customized for speci￿c IP or design.
3 APPROACH
3.1 Overview
This section presents our framework that arti￿cially develops syn-
thesis ￿ows for a given design. Our framework takes the HDL as
input and outputs two sets of synthesis ￿ows, namely angel-￿ows
and devil-￿ows, which provide the best and worst QoR according
to the design objectives. This problem is formulated as Multiclass
Classi￿cation and solved using CNN classi￿er. The main idea of our
approach is that training a CNN Classi￿er with a small set of labeled
3The names of these transformations are the same as the commands in ABC.
Developing Synthesis Flows Without Human Knowledge DAC ’18, June 24–29, 2018, San Francisco, CA, USA
Training
Flows
Synthesis
Tool
Sample Flows
HDL
Angel-Flows &
Devil-Flows
updated every 
500 ﬂows
Device: CPU Device: CPU/GPU
Training 
dataset
CNNs
Classiﬁer
Device: CPU/GPU
1 3
2
Figure 2: Overview of the proposed framework, performing
in sequence 1 ! 2 ! 3 .
random ￿ows. The classes (or labels) of the synthesis ￿ows are la-
beled based on one or multiple QoR metrics, such as delay, area,
power, etc. The trained classi￿er is used to predict the classes of a
large number of unlabeled random ￿ows. Finally, angel-￿ows and
devil-￿ows are generated by sorting the prediction con￿dence, i.e.,
the probability to be in a certain class (Section 3.3). This framework
is a generic model for designing synthesis ￿ows in many stages,
such as High-level synthesis and logic synthesis. The demonstra-
tion is made by designing logic synthesis ￿ows using ABC [11]
shown in Section 5. The ￿ow of our framework is shown in Figure
2, including three main components:
1 Generate training datasets. In this work, the training dataset
is a set of labeled synthesis ￿ows, namely training ￿ows. However,
the training ￿ows are originally unlabeled. This ￿rst step of our
approach is labeling a set of random ￿ows. This requires applying
these synthesis ￿ows to the input design and collecting the QoR
result at the end of each ￿ow. Note that applying a synthesis ￿ow
to a large design could be time-consuming. Hence, our framework
is performed in an incremental fashion. The CNN training (com-
ponent 2 ) starts after 1000 labeled ￿ows collected, and it will be
re-trained every 500 new labeled ￿ows collected. In this case, our
framework can produce the intermediate results during the training
process.
These ￿ows will be labeled according to the classi￿cation model
shown in Table 1. This model can be changed according to the de-
sign objectives, using either a single-metric or multi-metric model.
For example, if the design objective is area optimization, a single-
metric model will be selected where r is the areametric. If the design
objectives are minimizing delay with a given area budget, a multi-
metric model will be selected. Note that the number of classes (n+1)
is a ￿xed input of the proposed framework, and the de￿nition (QoR
range) of each class is decided using a general model. For example,
to de￿ne seven classes (n=6) in a single-metric model, it requires
six determinators, {x0, x1, ..., xn }. We de￿ne the six determinators
using the {5%,15%,40%,65%,90%,95%} QoR results of collected labeled
synthesis ￿ows. For example, assuming 1000 labeled ￿ows collected,
x0 is the 50th least value of the select metric and x6 is the 50th
largest value. Since the training dataset is updating incrementally,
the de￿nitions of classes may change dynamically. Angel-￿ows and
devil-￿ows are the subset of the ￿ows corresponding to classes 0
and n.
2  Design and train CNN Classi￿er. The second component
is training a CNN classi￿er that predicts the classes of unlabeled
￿ows. To train a CNN classi￿er, the training data, i.e., labeled syn-
thesis ￿ows, need to be represented in the matrix. We present a
Table 1: Labeling the training ￿ows based on synthesis QoR.
r and ri are the QoR metrics such as delay, area, power, etc.
Single-metric Multi-metric Class/Label
r  x0 r0  x0 , r1   0 0
x0 < r  x1 x0 <r0  x1 ,  0 <r1   1 1
x1 < r  x2 x1 <r0  x2 ,  1 <r2   2 2
... ... ...
r > xn r0 > xn ,r1 >  n n
one-hot modeling that represents synthesis ￿ow in binary matrix.
This model and the CNN architecture are introduced in Section 3.2.
3  Output Angel-￿ows and Devil-￿ows. The trained classi-
￿er will be used to predict the classes of a large number of un-tested
sample ￿ows. Althoughwe are only interested in the ￿ows in classes
0 and n, the classi￿er may label many ￿ows in these two classes.
However, for the synthesis perspective, selecting a small set of
￿ows is su￿cient. In this work, the angel-￿ows and devil-￿ows are
selected from the ￿ows labeled with 0 and nwith highest prediction
con￿dence. The details are included in Section 3.3.
3.2 CNN Classi￿er
3.2.1 One-hot Representation of Synthesis Flow. In this section,
the one-hot representation model of synthesis ￿ow is introduced
form-repetition ￿ow. The non-repetition ￿ow can be represented
using the same model.
LetM be the binary matrix of am-repetition ￿ow F with S={p0,
p1,..., pn } (see De￿nition 2). The number of transformations in F
equals to the length of the ￿ow L=n ⇥m (see Remark 2). Let the jth
synthesis transformation in F be pi , j  L, i  n. Its n-by-1 binary
vector representation is Vj , where ith element is 1 and the other
elements are 0.M is an L-by-n matrix such that its jth row is Vj .
Example 3:We illustrate the one-hot representation model us-
ing ￿ow F0 shown in Example 2, such that S={p0, p1} and F=p0 !
p0 ! p1 ! p1,M is an 4-by-2 matrix."1 0
1 0
0 1
0 1
#
Flow 
Matrix 
M
L-by-n
MConv
… MMax
Pool
… MConv
… MMax
Pool
…
Lo
ca
l
De
ns
e
3x6 2x2 3x6 2x2
Dr
op
ou
t
Figure 3: CNN architecture used for synthesis ￿ows classi￿-
cation.
3.2.2 CNN Architecture and Training. The input of the CNN
are L-by-n binary matrices representing the synthesis ￿ows. The
CNN includes convolution, pooling, locally connected, dense and
dropout layers. The kernel size of the convolutional and pooling
layers are shown in Figure 3. The dropout rate in the dropout
layer is 0.4 to prevent the over￿tting problem [9]. Since our inputs
are in one-hot representation, the loss function is computed using
sparse softmax cross entropy function. The output of the network
comes from softmax function. The number of kernels (￿lters) of
DAC ’18, June 24–29, 2018, San Francisco, CA, USA C. Yu et al.
convolutional layers are 200. The stride size of the convolutional
and pooling layers are 1 ⇥ 1.
Regarding the CNN architecture, two parameters have signi￿cant
impacts on the prediction performance (accuracy): a) kernel size of
convolutional layers and b) activation functions of convolutional
and dense layers. Unlike most of the CNN classi￿cation applications,
the n-by-n kernel size does not performwell in classifying synthesis
￿ows. We use n ⇥ 2n kernel size in this work. The reason is that
there is only one non-zero element in each row ofM. Using n ⇥ 2n
kernel could avoid computations over zero-matrix. The results of
comparing the accuracy of the CNN classi￿er using 3⇥6, 6⇥6, and
6⇥12 kernels are shown in Section 5 Figure 6.
The activation function of the nodes in the neural network de-
￿nes the output of the nodes with a given set of inputs. In arti￿cial
neural networks, this function is also called the transfer function.
The activation operations should provide di￿erent types of nonlin-
earities in the neural networks to solve Multiclass Classi￿cation
problems. In general, there are two types of activation functions,
including smooth nonlinear functions, such as Sigmoid, Tanh, Ex-
ponential Linear Units (ELU) [13], Scaled Exponential Linear Units
(SELU)[14], etc., and smooth continuous functions, such as Rec-
ti￿ed linear unit (ReLU) [15], Concatenated Recti￿ed Linear Units
(CReLu)[16], etc. We ￿nd that for classifying synthesis ￿ows, the
activation functions with nonlinearities perform better, such SELU
and Tanh. The activation functions including ReLU, ReLU6, ELU,
SELU, Softplus, Softsign, Sigmoid and Tanh, have been compared
in Section 5 Figure 7.
Regarding the training process, the CNN classi￿er is trained
speci￿cally for each design as described in Section 3.1. Since the
training data are collected incrementally, the CNNwill be re-trained
after every 500 new data points collected. The Mini-Batch [17]
training strategy is applied in this work with batch size 5, i.e., simul-
taneously evaluated ￿ve training examples in each iteration. In this
work, we have evaluated ￿ve di￿erent gradient descent algorithms,
including Stochastic gradient descent (SGD), Momentum [18], Ada-
Grad [19], RMSProp [20], and Follow the regularised leader (FTRL)
[21]. The comparison result is included in Section 5 Figures 4 and 5.
3.3 Angel-Flows and Devil-Flows
In this work, the outputs of the proposed framework are 200 angel-
￿ows and 200 devil-￿ows. There are two steps for generating these
￿ows. First, it uses the trained CNN classi￿er to predict the class of
a large number of random ￿ows. According to the classi￿cation rule
(Table 1), the angel-￿ows and devil-￿ows will be selected from the
0-class ￿ows and n-class ￿ows. The predicted class of a random ￿ow
is the class corresponding to the highest probability in the result
of the CNN classi￿er coming from softmax function. For example,
assuming the output of the classi￿er (# classes = 7) is {p0 = 0.47,p1 =
0.13,p2 = 0.22,p3 = 0.02,p4 = 0.03,p5 = 0.12,p6 = 0.01}, where pi
is the probability of a ￿ow being class-i , then the predicted class is
class-0. To minimize the errors in selecting the angel(devil)-￿ows,
our framework selects the ￿ows with highest p(0)(p(n)) within the
class-0(class-n) ￿ows.
Example 4: Let the prediction results in Table 2 be the prediction
outputs of the CNN classi￿er of four synthesis ￿ows. If two angel-
￿ows are required, F0 and F1 are selected and F4 is eliminated.
Table 2: Example of ￿nalizing the angel-￿ows.
Flow p0 p1 p2 p3 p4 p5 p6
F0 0.47 0.13 0.22 0.02 0.03 0.12 0.01
F1 0.51 0.12 0.01 0.09 0.17 0.08 0.02
F2 0.02 0.45 0.14 0.12 0.11 0.10 0.06
F3 0.12 0.03 0.17 0.62 0.01 0.02 0.03
F4 0.35 0.23 0.09 0.02 0.13 0.17 0.01
4 EXPERIMENTAL RESULTS
We demonstrate the proposed framework by designing logic synthe-
sis ￿ows Open-source synthesis framework ABC [11]. Our frame-
work is implemented in C++. The CNN classi￿er is implemented
using Tensor￿ow r1.3 [22] using its C++API. The demonstration is
made with three designs, including 64-bit Montgomery multiplier,
128-bit AES core [12], and 64-bit ALU [12]. The goal is to generate
200 angel-￿ows and 200 devil-￿ows for area or delay optimization.
We use the same setups shown in the motivating example (Section
2.2). Thus, the synthesis ￿ows will be 4-repetition ￿ows with six
ABC synthesis transformations, S={balance, restructure, rewrite,
refactor, rewrite -z, refactor -z}. The inputs of CNN classi￿er are
24-by-6matrices representing the synthesis ￿ows using the one-hot
modeling. These matrices are re-shaped to 12-by-12 matrices for
using two convolutional layers.
For generating the area- or delay-driven ￿ows, we use the single-
metric classi￿cation model (Table 1) where r is the area/delay of
the design. The number of classes is seven. The six determinators
are de￿ned using { 5%, 15%, 40%, 65%, 90%, 95% } of the area/delay
results of the training ￿ows. The area and delay results are obtained
after technology mapping with 14nm standard-cell library. The
number of training ￿ows is 10,000 and the number of sample ￿ows
for generating the ￿nal ￿ows is 100,000. The experimental results
are obtained using a machine with Intel Xeon 2x12cores@2.5 GHz,
256GB RAM, 2x240GB SSD and 2 Nvidia Titan X GPUs.
The result section includes two parts. The ￿rst part contains the
experimental results of training the CNN classi￿er. It consists of
the evaluations of di￿erent gradient descent algorithms, various
of convolutional kernel sizes and activation functions. Based on
these results, we ￿nd the best settings for the CNN architecture and
training strategy. Using these setting, we generate and evaluate the
quality of generated angel-￿ows and the devil-￿ows. To evaluate
the accuracy of the CNN classi￿er and the generated ￿ows, we
have explicitly collected the area and delay result by applying the
100,000 ￿ows to the three designs. Hence, the true classes of the
100,000 sample ￿ows are available for evaluation.
4.1 Results of Training CNN Classi￿er
4.1.1 Gradient Descent Algorithms. The results of training the
CNN classi￿er using di￿erent gradient descent algorithms are
shown in Figures 4 and 5. Figure 4 includes the results of train-
ing for generating area-driven ￿ows using ￿ve di￿erent algorithms,
including Stochastic gradient descent (SGD), Momentum [18], Ada-
Grad [19], RMSProp [20], and Follow the regularised leader (FTRL)
[21]; Figure 5 includes results of generating delay-driven ￿ows. The
learning rate  =0.0001 and number of training steps is 100,000. The
kernel size of convolutional layers is 6-by-12. In Figures 4 and 5,
the  -axis represents the accuracy of prediction. Let Nan el be the
number of generated angel-￿ows that their true class is class-0; let
Nde il be the number of generated angel-￿ows that their true class
Developing Synthesis Flows Without Human Knowledge DAC ’18, June 24–29, 2018, San Francisco, CA, USA
0 5 10 15 20 25 30 35 40 45 50
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(a) Montgomery Multiplier
0 10 20 30 40 50 60 70 80 90 100
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(b) AES Core
0 10 20 30 40 50 60 70 80 90 100
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(c) ALU
Figure 4: Evaluation of di￿erent gradient descent algorithms for generating the area-driven angel/devil ￿ows.
0 5 10 15 20 25 30 35 40 45 50
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(a) Montgomery Multiplier
0 10 20 30 40 50 60 70 80 90 100
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(b) AES Core
0 10 20 30 40 50 60 70 80 90 100
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
SGD
Momentum
AdaGrad
RMSProp
Ftrl
(c) ALU
Figure 5: Evaluation of di￿erent gradient descent algorithms for generating the delay-driven angel/devil ￿ows.
is class-6. The accuracy is de￿ned as following:
accurac  = {(Nan el + Nde il )/2 ⇥ 200}/2
The x-axis represents the training time of our framework. Note
that the training process of the 64-bit Montgomery multiplier is 2⇥
faster than the other two designs. The reasons is that collecting the
training dataset takes most of the runtime. The runtime of applying
one synthesis ￿ow to Montgomery multiplier is about 2⇥ faster
than the other two. The actually runtime for training the CNN
classi￿er is about 3 - 5% of the entire training time. As shown in
Figures 4 and 5, the RMSProp [20] outperforms other algorithms in
classifying synthesis ￿ows. The accuracy of the classi￿er in these
six experiments reaches 95% after 24 hours.
4.1.2 Choice of Convolutional Kernel Size. As mentioned in Sec-
tion 3.2, the size of the convolutional layer kernel has signi￿cant
impacts on the CNN classi￿er. In Figure 6, three kernel sizes, 3⇥6,
6⇥12, have been tested using RMSProp algorithm [20], where the
learning rate  =0.0001 and number of training steps is 100,000. The
number of kernels at each convolutional layer is 200. The input
design is the 128-bit AES core, and the objective is generating delay-
driven ￿ows. We can see that the kernel with size n⇥2n (3⇥6, 6⇥12)
perform much better than the n ⇥ n kernel (6⇥6).
0 20 40 60 80 100
Training time (hours)
0
0.2
0.4
0.6
0.8
1
Ac
cu
ra
cy
 (%
)
Kernel 3-by-6
Kernel 6-by-6
Kernel 6-by-12
Figure 6: Evaluation of three convolutional kernels. Test
case: generating delay-driven ￿ows for the 128-bit AES core.
4.1.3 Evaluation of Activation Functions. For evaluating the per-
formance of classifying synthesis ￿ows using di￿erent activation
functions, we set the learning rate  =0.0001, learning steps=100,000,
convolutional kernel size is 6⇥12, and use RMSProp to minimize the
loss function. Figure 7 includes the comparison of eight di￿erent
activation functions, including ReLU, ReLU6, ELU[13], SELU[14],
Softplus, Softsign, Sigmoid and Tanh. We can see that the ELU,
SELU, Softsign and Tanh functions outperform the others, and
SELU o￿ers the best accuracy for generating delay-driven ￿ows for
the 128-bit AES core. Note that the accuracy of di￿erent activation
functions varies on di￿erent datasets. In this work, SELU provides
most reliable performance.
ReL
U
ReL
U6 ELU SEL
U
Sof
tplu
s
Sof
tsig
n
Sig
moi
d
Tan
h
0
0.2
0.4
0.6
0.8
1
A
cc
ur
ac
y 
(%
)
Figure 7: Evaluation of di￿erent activation functions. Test
case: generating delay-driven ￿ows for the 128-bit AES core.
4.2 Quality of Generated Flows
Finally, we evaluate the quality of the generated angel-￿ows and
devil-￿ows. The results shown in Figure 8 are obtained using the
following settings: number of training ￿ows is 10,000; number of
sample ￿ows is 100,000;  =0.0001; learning steps is 100,000; acti-
vation function is SELU; gradient descent algorithm is RMSProp;
convolutional kernel size is 6⇥12. The four types of points shown
in Figure 8 represent the area-delay result of area-angel-￿ows, area-
devil-￿ows, delay-angel-￿ows and delay-angel-￿ows. The  -axis rep-
resents delay and x-axis represents area. The background of each
DAC ’18, June 24–29, 2018, San Francisco, CA, USA C. Yu et al.
sub-￿gures in Figure 8 is the 2-D distribution of the 100,000 sam-
ple ￿ows4. We can see that the generated area(delay) angel-￿ows
provide the best results in terms of area(delay), and the devil-￿ows
provide the worst results, among the 100,000 sample ￿ows. For
example, the data points of area-angel-￿ows of these three designs
are clearly bounded with a certain area value. The total runtime for
generating these ￿ows takes 3-4 days. It is demonstrated that our
framework can successively develop angel-￿ows and devil-￿ows.
6.5 6.55 6.6 6.65 6.7 6.75 6.8 6.85
Area ( m2) 104
1300
1350
1400
1450
1500
D
el
ay
 (p
s)
Area:Angel-Flows
Delay:Angel-Flows
Area:Devil-Flows
Delay:Devil-Flows
(a) Flows generated for 64-bit Montgomery multiplier.
6 6.5 7 7.5 8 8.5 9 9.5 10 10.5
Area ( m2) 105
80
85
90
95
100
D
el
ay
 (p
s)
Area:Angel-Flows
Delay:Angel-Flows
Area:Devil-Flows
Delay:Devil-Flows
(b) Flows generated for 128-bit AES core.
2.5 2.55 2.6 2.65 2.7
Area ( m2) 105
1.4
1.45
1.5
1.55
1.6
1.65
1.7
D
el
ay
 (p
s)
104
Area:Angel-Flows
Delay:Angel-Flows
Area:Devil-Flows
Delay:Devil-Flows
(c) Flows generated for 64-bit ALU.
Figure 8: Quality of the generated ABC synthesis ￿ows for
64-bit Montgomery multiplier, 128-bit AES and 64-bit ALU.
4The 2-D distribution represents the distribution similarly to Section 2.2
Figure 1, but with 100,000 data points.
5 CONCLUSIONS
This work presents a fully autonomous framework that arti￿cially
produces design-speci￿c synthesis ￿ows without human guidance
and baseline ￿ows. We introduce a general approach for ￿ow opti-
mization problems by modeling into Multiclass Classi￿cation. The
one-hot modeling of iterative ￿ows is proposed such that any ￿ow
can be represented using binary matrix. This approach is demon-
strated by generating the best, and worst synthesis ￿ows, using
three large designs with 14nm technology. The future work will
focus on arti￿cially developing cross-layer synthesis ￿ows to ￿nd
the missing-correlations between logic and physical designs [23].
6 ACKNOWLEDGEMENT
This project is funded by ERC-2014-AdG 669354 grant.
REFERENCES
[1] F. Akopyan, J. Sawada et al., “Truenorth: Design and tool ￿ow of a 65 mw 1million
neuron programmable neurosynaptic chip,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[2] C. Yu, M. J. Ciesielski, M. Choudhury, and A. Sullivan, “Dag-aware logic synthesis
of datapaths,” in Proceedings of the 53rd Annual Design Automation Conference,
DAC 2016, Austin, TX, USA, June 5-9, 2016, 2016, pp. 135:1–135:6.
[3] F. Agakov, E. Bonilla, et al., “Using machine learning to focus iterative opti-
mization,” in Proceedings of the International Symposium on Code Generation and
Optimization. IEEE, 2006, pp. 295–305.
[4] G. Liu and Z. Zhang, “A parallelized iterative improvement approach to area
optimization for lut-based technology mapping,” FPGA’17, 2017.
[5] R. Durrett, Probability: theory and examples. Cambridge university press, 2010.
[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553,
pp. 436–444, 2015.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi￿cation with deep
convolutional neural networks,” in Advances in neural information processing
systems, 2012, pp. 1097–1105.
[8] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features
for scene labeling,” IEEE transactions on pattern analysis and machine intelligence,
vol. 35, no. 8, pp. 1915–1929, 2013.
[9] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from over￿tting.” Journal of
machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[10] H. Mendelson, “On permutations with limited repetition,” Journal of Combinato-
rial Theory, Series A, vol. 30, no. 3, pp. 351–353, 1981.
[11] A. Mishchenko et al., “ABC: A System for Sequential Synthesis and Veri￿cation,”
URL http://www. eecs. berkeley. edu/alanmi/abc.
[12] “OpenCores,” URL https://opencores.org.
[13] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network
learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
[14] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing
neural networks,” arXiv preprint arXiv:1706.02515, 2017.
[15] V. Nair and G. E. Hinton, “Recti￿ed linear units improve restricted boltzmann
machines,” in Proceedings of the 27th international conference on machine learning
(ICML-10), 2010, pp. 807–814.
[16] W. Shang, K. Sohn, D. Almeida, and H. Lee, “Understanding and improving convo-
lutional neural networks via concatenated recti￿ed linear units,” in International
Conference on Machine Learning, 2016, pp. 2217–2225.
[17] G. B. Orr and K.-R. Müller, Neural networks: tricks of the trade. Springer, 2003.
[18] N. Qian, “On themomentum term in gradient descent learning algorithms,”Neural
networks, vol. 12, no. 1, pp. 145–151, 1999.
[19] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online
learning and stochastic optimization,” Journal of Machine Learning Research,
vol. 12, no. Jul, pp. 2121–2159, 2011.
[20] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a run-
ning average of its recent magnitude,” COURSERA: Neural networks for machine
learning, vol. 4, no. 2, pp. 26–31, 2012.
[21] H. B. McMahan, Holt et al., “Ad click prediction: a view from the trenches,” in
KDD’13. ACM, 2013, pp. 1222–1230.
[22] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin et al., “Tensor￿ow: Large-scale machine learning on
heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
[23] C. Yu, M. Choudhury, A. Sullivan, and M. J. Ciesielski, “Advanced datapath
synthesis using graph isomorphism,” in ICCAD 2017, Irvine, CA, USA, November
13-16, 2017, 2017, pp. 424–429.
