Performance Estimation of Synthesis Flows cross Technologies using LSTMs
  and Transfer Learning by Yu, Cunxi & Zhou, Wang
1Performance Estimation of Synthesis Flows
cross Technologies using LSTMs and Transfer
Learning
Cunxi Yu Member, IEEE, and Wang Zhou
F
Abstract—Due to the increasing complexity of Integrated Circuits (ICs)
and System-on-Chip (SoC), developing high-quality synthesis flows
within a short market time becomes more challenging. We propose a
general approach that precisely estimates the Quality-of-Result (QoR),
such as delay and area, of unseen synthesis flows for specific designs.
The main idea is training a Recurrent Neural Network (RNN) regressor,
where the flows are inputs and QoRs are ground truth. The RNN
regressor is constructed with Long Short-Term Memory (LSTM) and
fully-connected layers. This approach is demonstrated with 1.2 million
data points collected using 14nm, 7nm regular-voltage (RVT), and 7nm
low-voltage (LVT) FinFET technologies with twelve IC designs. The
accuracy of predicting the QoRs (delay and area) within one technology
is ≥98.0% over ∼240,000 test points. To enable accurate predictions
cross different technologies and different IC designs, we propose a
transfer-learning approach that utilizes the model pre-trained with 14nm
datasets. Our transfer learning approach obtains estimation accuracy
≥96.3% over ∼960,000 test points, using only 100 data points for
training.
Index Terms—Synthesis, deep learning, performance estimation, re-
current neural network
1 INTRODUCTION
Electronic Design Automation (EDA) is important due to the
increasing complexity of the designs and technologies. Demands
for designing electronic systems in novel application domains,
such as neuromorphic chips [1] and deep learning chips [2], raise
the challenges to a new level. For example, due to the lack of pre-
dictability of EDA techniques, the expensive design iterations with
extensive human supervision are unavoidable. Developing and
tuning design flows for specific designs is very time-consuming. In
addition, most design flows are currently developed based on the
knowledge of designers with iterative testing. However, because
of the huge search space of flows, it is difficult to know how good
the developed flows are. Hence, predictive flow-level modeling
and prediction, and design-specific tuning have very high value
[3] [4].
Deep learning has shown considerable success in a broad area
of applications using various deep neural network architectures,
• C. Yu is with the Department of Electrical and Computer Engineering,
Cornell University, Ithaca, New York, 14850, USA. W. Zhou is with IBM
Thomas J. Watson Research Center, Yorktown Height, New York, 10598,
USA.
E-mail: cunxi.yu@cornell.edu, wang.zhou@ibm.com
such as Convolutional Neural Network (CNN) and Recurrent
Neural Network (RNN). A CNN, which is one type of feed-
forward artificial neural networks constructed with convolutional
and pooling layers, has shown great performance in analyzing
images [5] [6]. An RNN is an artificial neural network where
the connections between units form a directed graph along a
sequence. This allows it to exhibit dynamic temporal behavior for
a timed sequence. Unlike feed-forward neural networks, RNNs
can use their internal memories to process sequences of inputs.
This makes them applicable to tasks such as speech recognition
[7], language translation [8], and generating textual descriptions
[9]. Deep learning has also been used for EDA techniques. For
example, ResNet [3] has been used for lithography modeling
optimization [10], and CNN has been used as a flow classifier
for generating design-specific synthesis flows [11].
In this paper, an RNN regression model is proposed to estimate
the performance of flows. The approach is demonstrated by
predicting the Quality-of-Results of logic synthesis flows with
three different technologies. Furthermore, this approach can be
used for performance estimation of flows in different domains,
such as physical design flows, compiling flows, etc. The main
contributions of this paper are:
• A closed formula is introduced to represent the search space
for arbitrary types of flows. A synthesis embedding model
that represents flows as discrete sequences using 2-D matrix
is introduced.
• An LSTM based RNN regression architecture is proposed.
The inputs are flows in the timed-model matrix, and ground
truth are delay and area collected after technology mapping.
• We propose a transfer-learning approach that adapts the
model learned from one technology node to another technol-
ogy node. This offers the ability to estimate the performance
for next/future technology nodes.
• The approach has been demonstrated with ∼1.2 million data
points with 14nm, regular-voltage (RVT) 7nm, and low-
voltage (LVT) 7nm FinFET technologies, collected with 12
different IC designs. We achieve testing accuracy ≥98.0%
for specific design and technology, and testing accuracy cross
technologies and designs is ≥96.3% after transfer learning.
• We demonstrate that the LSTM based approach with transfer
learning outperforms the CNN based approach [11] with all
the datasets using only 25 training points.
ar
X
iv
:1
81
1.
06
01
7v
1 
 [c
s.L
G]
  1
4 N
ov
 20
18
22 BACKGROUND
2.1 Synthesis flows and Search Space
Synthesis flows are a set of synthesis transformations that apply
iteratively to the input designs. The synthesis transformations are
mainly involved in three stages of the design flow: high-level
synthesis (HLS), logic synthesis (LS) and placement and route
(PnR). For different types of electronic designs, the flows need to
be changed accordingly.
In general, there are two types of flows, none-repetition flows
and m-repetition flows [11]. Given n unique transformations,
a flow developed with these transformations is called none-
repetition flow if each transformation appears only once. The
length of none-repetition flows is n. For m-repetition flows,
each transformation appears m times. The length of m-repetition
flows is m·n. In [11], the upper bound of the search space for
both types of flows are discussed. For none-repetition flows, a
closed representation n! is the upper bound of its search space.
An iterative formula is used to describe the search space of m-
repetition. However, the upper bound was given as a range without
a closed formula representation.
A closed formula is introduced to describe the search space
of repetition flows. The search space for m-repetition flows is a
multiset permutation problem. Specifically, for m-repetition flows
with n unique transformations, the search space is shown in
Equation 1.
(n ·m)!
(m!)n
(1)
Using the multiset permutation concept, we generalize the
formula to describe the search space for any type of flows. Let n
be the number of unique transformation, the M -repetition flows,
M={m1,m2, ...,mn}, where mi is the number of repetitions
of the ith transformation. The total number of possible flows is
shown in Equation 2.
S(m,n) =
(m1 +m2 + · · ·mn)!
(m1!)(m2!) · · · (mn!) (2)
2.2 Quality of Result of Digital ICs
Quality of Results (QoR) of Integrated Circuits (IC) is a term used
in evaluating technological processes. Mostly, it is represented as
a vector of values that describes the performance of the designs
and design process. For example, a QoR could include the critical
path delay (chip frequency = 1/Delay), power, area, e.g., {1 ns
(1 GHz), 2.5 W, 1 mm2}. Given a design specification, the QoR
of the technological processed designs can be very different using
different synthesis flows. An illustrative example of mapping a 1-
bit Full Adder using 7nm FinFET technology library is shown in
Figure 1. The specification of a 1-bit Full Adder is a + b + c =
2C+S, where a, b, c, C, S are binary signals. Using two different
synthesis flows, two gate-level netlists are produced. In Figure
1, each node represents one logic gate and the type label of the
node is the type of the logic gate. Note that the performance of
different logic gates are different and they are defined based on the
technology information. For example, delay and area of XOR2 are
15.95 ps and 2.8 um2; delay and area of NAND2 are 11.90 ps and
1.39 um2. After applying synthesis flow, we observe that the QoRs
of two netlists are different. The design in Figure 1a is about 37%
faster than design in Figure 1b, and requires 12% more area. Note
that the QoR can only be obtained after the entire technological
design process. For large designs, this process is extremely time-
consuming.
Depending on the application, the design objectives can be
very different. For instance, for high-performance image process-
ing designs, the designs are mostly designed to be as fast as pos-
sible, i.e., delay to be as small as possible. For Internet-of-Things
(IoT) designs, the power and area are required to be minimized.
The massive search space of synthesis flows (Equations 1 and 2)
and the time-consuming technological design process, along with
the various design objectives, are the main motivations of this
work.
C S
9
oai21
10
xnor2a
6
inv1
7
nand2
8
xnor2a
cab
(a)
C S
9
or2
10
xor2
6
and2
8
and2
7
xor2
ca b
(b)
Fig. 1: 1-bit Full Adder (F = 2C + S = a + b + c) gate-
level netlists produced by two different synthesis flows using 7nm
FinFET technology library. a) Delay = 30.1 ps, Area = 9.40 um2;
b) Delay = 47.6 ps, Area = 8.38 um2.
2.3 Recurrent neural network
Recurrent Neural Network (RNN) is a class of artificial neural
network with a chain of units in a directed graph sequence. RNNs
perform the same computations for each unit in a sequence, and
output states depend on the previous states. In theory, RNNs
can make use of information in arbitrarily long sequences, but
in practice, they are limited to looking back only a few steps,
called “Long-Term Dependencies” problem [12]. Long Short-
Term Memories (LSTMs) [13] are explicitly designed to address
the long-term dependency problem by adding control gates to the
recurrent units. Such controlled states are referred to as a gated
state or gated memory, which have been implemented as part
of LSTMs and Gated Recurrent Units (GRUs) [14]. An RNN
composed of LSTM units is often called an LSTM network. A
common LSTM unit is composed of a cell, an input gate, an output
gate and a forget gate. The cell is responsible for “remembering”
values over arbitrary time intervals.
The LSTM has the ability to remove or add information to the
cell state, carefully regulated by structures called gates. Gates are
a way to optionally let information through. They are composed
out of a sigmoid neural net layer and a pointwise multiplication
operation. The sigmoid gate (layer) outputs numbers in the range
of [0, 1], which determines the ratio of the outputs of previous
component going into next one.
32.4 Related work
Yu et al. [11] presented a deep learning based approach for
generating design-specific synthesis flows for ABC. The main idea
of this approach is formulating the flow optimization problem
as a Multiclass Classification problem. The authors proposed to
use Convolutional Neural Network (CNN) based classifier that
includes two Convolution+MaxPool layers and three Dense layers.
It shows that the classifier can successfully distinguish the best and
worst flows given the design objectives. However, there are two
main limitations:
• The classifier can only classify the flows into different per-
formance classes, however, it cannot distinguish the perfor-
mance of different flows within the same class.
• The prediction accuracy heavily relies on the labeling rules
since the labels of the flows are post-created based on the
Quality-of-Result (QoR).
It is obvious that the first limitation comes from the idea of
flow classification. Therefore, we focus on illustrating the second
limitation. The single-metric and multi-metric rules are introduced
in [11] that label the synthesis flows based on single QoR or
multiple QoR metrics, such as area, delay, etc. The labeling
rule for seven-classes labeling requires six QoR delimiters. For
example, let the six delimiters be the data point at 7%, 20%, 40%,
65%, 80%, and 93% position of training datasets (assuming the
training set is sorted from best-to-worse QoR), namely Labeling
rule 1 (Figure 2a). Alternatively, let the six delimiters be the
data point at 5%, 15%, 40%, 65%, 90%, and 95% position of
training datasets, namely Labeling rule 2 (Figure 2b). We compare
the classification performance of two labeling rules using the
CNN architecture and the 64-bit Montgomery Multiplier dataset
proposed in [11]. The training and testing of CNN Classifiers are
done with Keras using Tensorflow as backend. The results are
shown in the confusion matrices in Figure 2.
(a) Labeling rule 1. (b) Labeling rule 2.
Fig. 2: Confusion matrices of two classifiers that trained with two
different labeling rules.
As mentioned in [11], the main tasks are searching for the best
(class 0) and worst (class 6) flows for a given design objective.
The results show that labeling rule 1 provides 98% accuracy for
predicting class 0, and 86% accuracy for class 6; labeling rule
2 provides only 46% accuracy for class 0, and 89% for class 6.
We can see that using different labeling rules, the performance of
the classifiers can be very different. Note that the labeling rule
input to this approach, which should be defined by the user. These
two limitations offer the main motivation for our regression based
approach.
As described previously, a synthesis flow is a sequence of
transformations. Mostly in IC design, QoR of a large number of
synthesized designs is a continuous variable. To achieve accurate
estimation of a continuous variable that is related to sequential
behaviors, we explore LSTM network regressor in this work. In the
result section, we also compare the performance with a regressor
built using the proposed CNN model [11].
3 MODELING
This section introduces the inputs of the neural network and
ground truth for the regression model. The inputs are the synthesis
flows that are represented by a 2-D matrix using a novel Timed-
Model of flows. The ground truth includes delay and area results
after technology mapping.
3.1 Inputs: Timed-Model of flows
TABLE 1: Illustration of timed-model of synthesis flows using
ABC default synthesis flow resyn. Synthesis transformation at
each time spot is shown above the visualized binary matrix.
balance
t1
rw
t2
rwz
t3
balance
t4
rwz
t5
balance
t6
t1 t2 t3 t4 t5 t6
b rw rwz b rwz b
As mentioned earlier, any flow includes a set of transfor-
mations that perform iteratively. We illustrate the concept of
timed-model using one common logic synthesis flow provided in
ABC [15], resyn, which includes six transformations: balance (b),
rewrite (rw), rewrite -z (rwz), b, rwz, b.
The time-line of applying resyn to designs is shown in Table
1. Each transformation in this flow is applied to the design at
each time frame. For example, for time in range (0,t1), the
transformation balance is applied and it finishes at t1. Then,
the second transformation rewrite starts and finishes at t2. The
whole flow finishes at t6. Note that the runtime of different
transformations can be very different; and the runtime of the same
transformation at different stage could be different as well. In
this work, the runtime of each transformation is not included in
the modeling. This means that the timed-model of the flows is
considered as a discrete sequence. Using one-hot encoding for the
three transformations in resyn, let balance=[1 0 0], rw = [0 1 0],
and rwz = [0 0 1]. The resulting timed-model of resyn in binary
matrix is shown in Table 1.
A more complex example is shown in Table 2. The input syn-
thesis flow is an ABC synthesis flow including six transformations,
balance (b), restructure (rs), rewrite (rw), refactor (rf), rewrite -z
(rwz), refactor -z (rfz), and these transformations are repeated four
times. The length of this synthesis flow is 24 such that it requires
24 time-frames to complete. With total six transformations, the
final model of this synthesis flow is a matrix of shape (6,24)
(Figure 2). This type of matrices will be the input to the neural
network for training and inference.
4TABLE 2: Example of a 24 transformations long synthesis flow using the presented timed-model. The flow is a 4-repetition synthesis
flow with six unique transformations, {rw, rwz, b, rs, rfz, rf}, and each transformation repeats four times. The complete synthesis flow
is shown in the second row and synthesis transformation at each time spot is shown above the visualized binary matrix.
t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15 t16 t17 t18 t19 t20 t21 t22 t23 t24
b rf rwz rw rs rfz b rw rwz rf rs rfz b rw rwz rw rs rfz b rf rwz rfz rs rf
3.2 Ground truth
In the context of machine learning, the ground truth is a mea-
surement of the target variable(s) for the training and testing data
points. In other words, the ground truth defines the objective(s)
of the learning model. In the scenario of training a regressor for
synthesis, taking synthesis flows as inputs, the ground truth could
be synthesis runtime, critical path delay, total logic area, XOR
counts, etc. Similarly, this can be extended to other flow perfor-
mance estimation problem such as placement and route, with the
ground truth being worst negative slacks, total negative slacks,
routing length, etc. In the result of this paper, the demonstration
and evaluation of the proposed approach specifically target on
synthesis flows of the open source logic synthesis framework ABC
[15]. The ground truth includes critical path delay (delay in short)
and logic area (area in short).
4 APPROACH
This section presents the implementation of LSTM based RNN
regressor, training setup and the summary of datasets.
4.1 LSTM network architecture
The RNN regressor architecture is presented in Table 3. The re-
gressor is designed with LSTM×2, Batch Normalization (BN)×4,
Dropout×1, and Dense layers×3. The first column shows the
layers and its type in a top-down order. The second column
presents the output shape of the current layer, and the last column
shows the number of parameters in each layer. The activation
function of the Dense1 and Dense2 layers is ReLu. The output
layer is implemented with a dense layer where the number of
units equals to the ground truth dimension, dim. In this work,
the ground truth dimension is one, i.e., either area or delay. The
activation function for the last layer is Linear.
• LSTM Layer (Layers 1 and 3): The core of the model
consists of two LSTM layers. Both LSTM layers include 128
hidden units. The inner recurrent activation applied to input,
forget, and output gates is hard sigmoid, i.e., segment-wise
linear approximated sigmoid function. The activation for the
hidden state and output hidden state is hyperbolic tangent
function (tanh).
• Batch Normalization (Layers 2,4,6,8): Batch Normaliza-
tion (BN) [16] in general helps the training in speed and
accuracy. The basic idea of batch normalization is similar to
data normalization in training data pre-processing. Instead of
applying normalization to the training data only, BN applies
normalization over the hidden layers.
• Dropout Layers (Layer 9): Dropout is a regularization
technique, which aims to reduce the complexity of the model
with the goal to prevent overfitting. Dropout layer consists in
randomly setting a fraction rate of input units to 0 at each
update during training time [17]. The units that are kept are
scaled by 1/(1− r), where r is the dropout rate, so that the
sum is unchanged during training and inference processes. In
this paper, the dropout rate is 0.4.
• Dense Layers (Layers 5,7,10): Dense layer is applying lin-
ear operation in which every input is connected to every out-
put by a weigh, generally followed by a non-linear activation
function to add nonlinearity to the model. Specifically, the
dense layer in this work performs activation(multiply(input,
kernel)), where activation is the element-wise activation
function, kernel is a weights matrix created by the layer.
The values in the kernel matrix are the trainable parameters
which get updated during back-propagation. The activation
functions of the Dense1 and Dense2 layers is ReLu, and
activation of last layer is Linear.
TABLE 3: LSTM based RNN model architecture, including the
output shape and number of parameters of each layer.
Layer : Type Output Shape # Param
1:LSTM1 (None, 24, 128) 68608
2:BN1 (None, 24, 128) 512
3:LSTM2 (None, 128) 131584
4:BN2 (None, 128) 512
5:Dense1 (None, 30) 3870
6:BN3 (None, 30) 120
7:Dense2 (None, 30) 930
8:BN4 (None, 30) 120
9:Dropout (None, 30) 0
10:Dense3 (None, dim) 31×dim
4.2 Datasets
The datasets are generated by logic synthesis tool ABC [15], with
100,000 random flows generated. All 100,000 flows are applied
to three different designs, 64-bit Montgomery Multiplier, 64-bit
ALU and 128-bit AES core, using 14nm, 7nm RVT and 7nm
LVT technologies. For exploring the transferability cross designs
and technologies, we apply the first 20,000 random flows to nine
more designs with different IPs (intellectual property) (Table 4),
including cryptographic hash SHA, RISC (Reduced Instruction
Set Computer) architecture Open RISC 1200 OR1200, etc. These
designs are obtained from OpenCore [18]. Note that some of the
random flows fail because of the internal ABC crashes (segment
fault reported). There are three failure cases observed while
applying to the Montgomery multiplier, and 263 failure cases for
the AES core. There are ∼300,000 data points generated using
14nm technology with 3 designs, and ∼960,000 data points using
57nm technologies with 12 designs. The summary of the datasets
is shown in Table 4.
Random flows: Each random flow includes six different transfor-
mations, and each transformation can repeat four times (example
shown in Table 2), resulting in totally twenty-four transformations
in each flow. These 100,000 random flows are generated by
randomly permuting these twenty-four transformations.
Inputs and Labels: The inputs of the neural network are the
flows using the timed-model matrix representation with shape
(6,24). The labels include the delay or area results collected by
applying the random flow following by ABC technology mapping
(command: map -v).
TABLE 4: Summary of Datasets. Data points are generated with
100,000 random flows using three different technology libraries
using the first three designs. For the rest of the designs, the data
points are collected with 20,000 random flows using the 7nm RVT
and LVT FinFET technologies. The ground truth are the QoRs
(delay or area) that are collected after technology mapping. *RVT
= Regular Voltage Transistor; *LVT = Low Voltage Transistor.
Design 14nm 7nm RVT 7nm LVT
64-bit Montgomery 99,997 99,997 99,997
64-bit ALU 100,000 100,000 100,000
128-bit AES core 99,737 99,737 99,737
LU8PEng - 20,000 20,000
Stereovison0 - 20,000 20,000
Stereovison1 - 20,000 20,000
SHA - 20,000 20,000
raygentop - 20,000 20,000
OR1200 - 20,000 20,000
Boundtop - 20,000 20,000
blob merge - 20,000 20,000
bgm - 20,000 20,000
Inputs Ground truth
Flow (6, 24) Delay/Area (1,1)
4.3 Training setups and pre-processing
Training setups: The loss function is the mean squared error
(MSE) and is optimized with Adam optimizer [19] with learning
rate=0.001, β1=0.9, β2=0.999. The batch size used in this work is
256 and models are trained for 1000 epochs.
Pre-processing (Data normalization): The training data points
are normalized before model training. Specifically, the label vec-
tors are normalized by subtracting its mean and dividing its range.
The mean and range are used to reconstruct the ground truth at
testing.
5 TRANSFERABILITY CROSS DESIGNS AND TECH-
NOLOGIES
We explore the transferability over different technologies and
designs using the approach shown in Figure 3. The main idea
is to utilize the model pre-trained with 14nm data points and
update the model with little data points to predict for unseen
7nm technologies and designs shown in Table 4. In this work, we
restrict the number of data points for transfer learning to be ≤100.
Specifically, the results of transfer learning using {10, 25, 50, 100}
data points are included in the result section. The evaluations are
made using the rest of the 7nm datasets.
5.1 Transfer Learning Strategies
Two transfer learning approaches are implemented.
14nm Datasets
Model (14nm)
14nm Datasets
~80,000 x 3
20,000 x 3
Testing
Training
7nm LVT/RVT Datasets
<=100
Updated Model
Update 
weights
~960,000
Testing
Transfer Learning
Fig. 3: Transfer-learning cross different technologies. Initial model
is trained using 20% of the 14nm datasets, i.e., 20,000 data
points of each design. The pre-trained model is updated using
≤100 new data points, which are produced using unseen 7nm
technologies and IC designs. The rest of the 7nm datasets are used
for evaluating the transfer-learning approach, including 960,000
data points.
Updating Dense Layers: This approach takes the pre-trained
model and turns the LSTM layers to be non-trainable, i.e., layers
1-4 shown in Table 3. The main intuitions of this approach are that
1) the sequential behavior of the synthesis flows could be similar
over different designs and technologies, and 2) the sequential
features have been learned mostly in the LSTM layers. In this case,
there are only about 5,000 parameters in the pre-trained model that
need to be updated during transfer learning.
Updating All Layers: However, if the sequential behaviors of the
synthesis flows are different over different designs and technolo-
gies, the model could fail to converge without updating the LSTM
layers. Hence, this approach updates the parameters of all layers
in the pre-trained model.
6 RESULT
First, the pre-trained model is evaluated with 14nm datasets by
evaluating the delay and area prediction accuracy, with 20,000
data points for training and∼80,000 for evaluation for each design
(Table 4). Secondly, we evaluate our transfer learning approaches
on the delay and area of the 7nm RVT/LVT datasets. The training
and testing are conducted on a server with 28 Intel Xeon CPU
E5-2690 v4 processors and 256 GB memory. The experiments
are implemented in Python3 using deep learning framework Keras
[20] with Tensorflow [21] as backend.
6.1 Evaluation within 14nm datasets
The results in this section use the training setups provided in
Section Approach, with epochs=1000, and 20% training data for
validation. The training results using the 14nm datasets (delay) of
64-bit Montgomery multiplier are shown in Figure 5. Specifically,
Figure 5a shows the training loss and the validation loss (y-axis)
with respect to the number of epochs (x-axis). Figure 5b shows
the prediction results with the flows of the training data points as
inputs. The x-axis represents the true delay of the flow, and the
y-axis represents the predicted delay, with unit ps. The training
data points fit perfectly after 1000 epochs, with average prediction
error 0.255%.
The model is then tested using the remaining 80,000 14nm
data points. The testing results are shown in Figure 4. The testing
dataset is randomly split into four subsets, with each sub-set
including ∼20,000 points. This is used to demonstrate that the
prediction accuracy does not differ much while choosing different
inputs. The average delay prediction error are 0.370%, 0.369%,
0.367%, and 0.371%, with an overall average error 0.369%.
6(a) Sub-set 1: avg error=0.370% (b) Sub-set 2: avg error= 0.369% (c) Sub-set 3: avg error=0.367% (d) Sub-set 4: avg error=0.371%
Fig. 4: Visualization of delay prediction with the remaining ∼80,000 data points of 14nm Montgomery dataset as testing inputs. Each
sub-set includes 20,000 test results (last sub-set includes 19,997). Overall prediction accuracy over 79,997 test points is 99.6%.
0 200 400 600 800 1000
Epoch
10 3
10 2
10 1
Lo
ss
Train_Loss
Val_Loss
(a) Training loss vs. validation
loss.
(b) Predictions using the train-
ing data.
Fig. 5: Training results with 20,000 14nm 64-bit Montgomery
multiplier as inputs.
64-bit-Mont 64-bit-ALU 128-bit-AES85
90
95
100
Ac
cu
ra
cy
 (%
)
99.7
%
99.4
%
99.3
%
99.6
%
98.4
%
98.4
%
Delay:Train Accuracy (%)
Delay:Test Accuracy(%)
(a) Results of delay prediction.
64-bit-Mont 64-bit-ALU 128-bit-AES85
90
95
100
Ac
cu
ra
cy
 (%
)
99.9
%
99.6
%
98.6
%99.6
%
98.7
%
98.0
%
Area:Train Accuracy (%)
Area:Test Accuracy(%)
(b) Results of area prediction.
Fig. 6: Evaluation of delay and area prediction using 14nm
datasets.
Similarly, we evaluate our approach using the 14nm datasets for
both delay and area, shown in Figure 6. The y-axis is the training
and testing accuracy for delay and area of three designs. The delay
testing accuracy are 99.6%, 98.4%, and 98.4% (Figure 6a), and
the area testing accuracy are 99.6%, 98.7%, 98.0% (Figure 6b).
In summary, ∼80,000 test data points, the prediction accuracy for
delay and area for all three designs are ≥ 98.0%.
6.2 Transfer Learning: cross designs and technologies
This section presents the evaluation results of transfer learning
using the approaches shown in Figure 3. The testing datasets are
the 7nm datasets shown in Table 4. We first show the complete
delay prediction results using 64-bit GF multiplier design. The
initial model is pre-trained with 20,000 14nm data points. Then,
we update the weights of all layers of this model with 100
7nm data points from the same design. The rest data points for
each technology are used for testing. The prediction accuracy
is ≥99.5% for both 7nm technologies. Note that these show the
transferability cross the technologies only.
(a) 100k 7nm RVT predictions.
Average accuracy = 99.5%
(b) 100k 7nm LVT predictions.
Average accuracy = 99.6%
Fig. 7: Visualization of 7nm delay predictions using transfer
learning with prediction accuracy≥99.5%. Initial model is trained
with 64-bit GF multiplier 14nm datasets, and is updated with 100
7nm data points.
To explore the transferability cross both designs and technolo-
gies, we apply transfer learning to all 12 designs with the initial
model pre-trained with 14nm GF delay dataset. Note that industrial
studies indicate that machine learning based electronic design
systems require a minimum of 95% accuracy for performance
estimation [22]. More importantly, these systems are required to be
stable (i.e., have similar accuracy) for different types of designs.
The results of two approaches, updating dense layers only and
updating all layers, are included in Figure 8. To demonstrate the
advantages of transfer learning, the results of training a new model
from scratch using the same amount of data points, without pre-
training on 14nm data first, are included as a baseline. It shows that
transfer learning by updating all layers provides the best results
over all designs, for delay and area estimations. Our approach
obtains ≥96.3% accuracy for delay and area over all designs,
using 100 7nm data points. In comparison to transfer learning
approaches, training a new model without transfer learning yields
much worse accuracy due to insufficient training data. This indi-
cates that transfer learning is helpful when the size of available
training data is limited.
We also compare our LSTM network with the CNN based
approach. We modify the model released in [11] by 1) changing
the the output layer from multi-channel softmax output to single-
channel linear output and 2) adding BN following the Convolu-
tional layers. For fair comparison, we only compare the delay/area
estimation accurate for GF, AES, and ALU designs that were
used in that work. It shows that with 100 training data points,
CNN regressor performs much worse than the LSTM regressor
(Figure 8). The transfer-learning results using CNN approach are
not included since they perform worse than training new CNN
7model.
LU8P
EEng
stere
ovisi
on1
stere
ovisi
on1SHA
rayg
ento
p
OR1
200
boun
dtop
blob
_me
rgebgm GF64 AES-
128 ALU
70
80
90
100
Ac
cu
ra
cy
 (%
)
Dense-Only 100
Dense-Only 50
Dense-Only 25
All-layers 100
All-layers 50
All-layers 25
New Model 100
New Model 50
New Model 25
CNN New 100
(a) Delay.
LU8P
EEng
stere
ovisi
on1
stere
ovisi
on1SHA
rayg
ento
p
OR1
200
boun
dtop
blob
_me
rgebgm GF64 AES-
128 ALU
70
80
90
100
Ac
cu
ra
cy
 (%
)
Dense-Only 100
Dense-Only 50
Dense-Only 25
All-layers 100
All-layers 50
All-layers 25
New Model 100
New Model 50
New Model 25
CNN New 100
(b) Area.
Fig. 8: Evaluation of two transfer learning approaches using
25/50/100 data points. CNN New 100 results are generated using
technique in [11] with 100 data points for transfer learning.
Finally, we try to find the minimum number of data points for
transfer learning to achieve reasonable accuracy. Specifically, we
choose the approach of updating all layers, and set the number of
training data points to 5/10/25. The results are shown in Figure
9. For both delay and area, the estimation accuracy significantly
decreases with ≤10 data points. This suggests that at least 25
data points are needed for the proposed transfer learning approach
to achieve stable estimation accuracy cross different designs and
technologies.
LU8P
EEng
stere
ovisi
on1
stere
ovisi
on1SHA
rayg
ento
p
OR1
200
boun
dtop
blob
_me
rgebgm GF64 AES-
128 ALU
75
80
85
90
95
100
Ac
cu
ra
cy
 (%
)
Delay 25
Delay 10
Delay 5
Area 25
Area 25
Area 5
Fig. 9: Evaluation of transfer learning (updating all layers) for
delay and area estimation using 5/10/25 data points.
7 CONCLUSION
This paper presents an RNN regression based approach that
precisely estimates the delay and area of synthesis flows. The
proposed RNN regressor is constructed using LSTM network
with batch normalization and dense layers. To enable accurate
predictions for future technologies and different designs, we
propose a transfer-learning approach that utilizes the pre-trained
model and requires much less training data. The demonstrations
are made with logic synthesis tool ABC using 14nm and 7nm
FinFET technologies, and models are tested over 1.2 million data
points. The results show the prediction accuracy of delay and area
is≥98.0% for single technology, and the prediction accuracy after
transfer-learning cross designs and technologies is ≥96.3% with
only 100 new data points. This demonstrates that the proposed
transfer learning approach can effectively learn to estimate QoR
for unseen technologies and designs. Future work will focus on
performance estimations at physical layout level (e.g., silicon
routing congestion), and 5nm technologies.
REFERENCES
[1] F. Akopyan, J. Sawada et al., “Truenorth: Design and tool flow of
a 65 mw 1 million neuron programmable neurosynaptic chip,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and
Systems, vol. 34, no. 10, pp. 1537–1557, 2015.
[2] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa,
S. Bates, S. Bhatia, N. Boden, A. Borchers et al., “In-datacenter per-
formance analysis of a tensor processing unit,” in Proceedings of the
44th Annual International Symposium on Computer Architecture. ACM,
2017, pp. 1–12.
[3] A. B. Kahng, “New Directions for Learning-based IC Design Tools
and Methodologies,” in Proceedings of the 23rd Asia and South Pacific
Design Automation Conference. IEEE Press, 2018, pp. 405–410.
[4] J. Burns, “Keynote i:“designing heterogeneous systems in the ai era:
Challenges and opportunities”,” in Design Automation Conference (ASP-
DAC), 2018 23rd Asia and South Pacific. IEEE, 2018, pp. 26–27.
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
mation processing systems, 2012, pp. 1097–1105.
[6] R. Girshick, “Fast r-cnn,” arXiv preprint arXiv:1504.08083, 2015.
[7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with
deep recurrent neural networks,” in Acoustics, speech and signal pro-
cessing (icassp), 2013 ieee international conference on. IEEE, 2013,
pp. 6645–6649.
[8] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
2014.
[9] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural
image caption generator,” in Computer Vision and Pattern Recognition
(CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 3156–3164.
[10] Y. Lin, Y. Watanabe, T. Kimura, T. Matsunawa, S. Nojima, M. Li, and
D. Z. Pan, “Data efficient lithography modeling with residual neural
networks and transfer learning,” in Proceedings of the 2018 International
Symposium on Physical Design. ACM, 2018, pp. 82–89.
[11] C. Yu, H. Xiao, and G. De Mecheli, “Developing synthesis flows without
human knowledge,” 2018 Design Automation Conference (DAC’18),
2018.
[12] S. Hochreiter, “Untersuchungen zu dynamischen neuronalen netzen,”
Diploma, Technische Universita¨t Mu¨nchen, vol. 91, p. 1, 1991.
[13] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[14] K. Cho, B. Van Merrie¨nboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, “Learning phrase representations using
rnn encoder-decoder for statistical machine translation,” arXiv preprint
arXiv:1406.1078, 2014.
[15] A. Mishchenko et al., “ABC: A System for Sequential Synthesis and
Verification,” URL http://www. eecs. berkeley. edu/alanmi/abc, 2018.
[16] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” arXiv preprint
arXiv:1502.03167, 2015.
[17] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from overfit-
ting.” Journal of machine learning research, vol. 15, no. 1, pp. 1929–
1958, 2014.
[18] O. Cores, “Open Source Gateware IP Cores,” URL https://opencores.org.
[19] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint arXiv:1412.6980, 2014.
[20] F. Chollet et al., “Keras,” 2015.
[21] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale
machine learning on heterogeneous distributed systems,” arXiv preprint
arXiv:1603.04467, 2016.
[22] J. Dyck, “Mentor, A Siemens Business: Production Ready Machine
Learning for EDA,” in Design Automation Conference (DAC’18), 2018.
