CSM-NN: Current Source Model Based Logic Circuit Simulation -- A Neural
  Network Approach by Abrishami, Mohammad Saeed et al.
CSM-NN: Current Source Model Based Logic
Circuit Simulation - A Neural Network Approach
Mohammad Saeed Abrishami, Massoud Pedram, and Shahin Nazarian
Ming Hsieh Department of Electrical and Computer Engineering
Viterbi School of Engineering, University of Southern California
Los Angeles, CA 90089
{abri442, pedram, shahin.nazarian}@usc.edu
Abstract—The miniaturization of transistors down to 5nm
and beyond, plus the increasing complexity of integrated cir-
cuits, significantly aggravate short channel effects, and demand
analysis and optimization of more design corners and modes.
Simulators need to model output variables related to circuit
timing, power, noise, etc., which exhibit nonlinear behavior. The
existing simulation and sign-off tools, based on a combination of
closed-form expressions and lookup tables are either inaccurate
or slow, when dealing with circuits with more than billions
of transistors. In this work, we present CSM-NN, a scalable
simulation framework with optimized neural network structures
and processing algorithms. CSM-NN is aimed at optimizing the
simulation time by accounting for the latency of the required
memory query and computation, given the underlying CPU and
GPU parallel processing capabilities.
Experimental results show that CSM-NN reduces the simu-
lation time by up to 6× compared to a state-of-the-art current
source model based simulator running on a CPU. This speedup
improves by up to 15× when running on a GPU. CSM-NN also
provides high accuracy levels, with less than 2% error, compared
to HSPICE.
Index Terms—Current Source Model (CSM), Logic Circuit
Simulation, Process Variation, Neural Network, L-BFGS Opti-
mization
I. INTRODUCTION
The down-scaling of transistor geometries has drasti-
cally increased the complexity of short channel effects and
process-voltage-temperature (PVT) variations. Consequently,
application-specific integrated circuit (ASIC) design flow
techniques, such as multi-corner multi-mode (MCMM) and
parametric on-chip variation (POCV) depend on increasingly
more complex analysis, transformation, and verification itera-
tions, to ensure the ASIC system functions correctly and meets
design demands such as those related to performance, power
and signal integrity. In these methods, the design is tested
in different process-voltage-temperature (PVT) corners and
operating modes such as low-power (LP), high-performance
(HP), etc. Accurate simulation such as those for timing anal-
ysis during placement, clock network synthesis, and routing
is crucial as it helps to lower the number of design iterations,
speed up convergence, and plays a major role in the turnaround
time of complex designs such as system-on-chips (SoCs) [1].
SPICE simulations are accurate but very slow for timing,
power, thermal analysis, and optimization of modern ASIC de-
signs with billions or trillions of transistors [2], [3]. Therefore,
higher levels of circuit abstraction using approximation has
been used to speed up simulation steps. Abstraction models are
generally based on look-up-tables (LUTs), closed-form formu-
lations, factors or their combinations. The traditional models,
namely nonlinear delay model (NLDM), nonlinear power
model (NLPM), effective current source model (ECSM [4]),
and composite current source model (CCSM [5]) utilize LUTs
for storing delay, noise or power as nonlinear functions w.r.t.
physical, structural, and environmental parameters, and depend
on voltage modeling more than current modeling. We refer to
NLDM, ECSM, and CCSM models as voltage-LUT (V-LUT)
throughout this paper. The V-LUT models are intuitively better
choices when compared to simple closed-form formulation of
nonlinear functions, however, tend to be increasingly inaccu-
rate in capturing signal integrity and short channel effects with
the down-scaling of technologies [6].
Alternatively, current source models (CSMs) [7]–[15]
use voltage-dependent current sources and possibly voltage-
dependent capacitances to model logic cells. In addition to
higher accuracy, another advantage of CSM over V-LUT mod-
els is the ability to simulate realistic waveforms for arbitrary
input signals and provide the output waveforms.
The number of CSM component values that should be
stored in memory grows exponentially with the number of
inputs and internal nodes in the logic cell. For example,
6-dimensional LUTs are required for modeling a 3-input
NAND gate (NAND3). While V-LUT models are stored in
smaller/faster memories such as L1-cache, relatively bigger
tables in CSM-LUT can only fit into bigger/slower ones, like
DRAM. Therefore a fundamental idea to shorten simulation
time would be to replace some of the memorization with
computation aiming for optimal space/time efficiency.
In [16], a Semi-Analytical CSM (SA-CSM) was presented
which uses small-size LUTs combined with nonlinear analyt-
ical equations to simultaneously achieve high modeling accu-
racy and space/time efficiency. However, developing analytical
equations for complex circuits is a tedious process.
In this work, we propose CSM-NN, a circuit simulation
framework that fully replaces LUTs with neural networks
(NNs). This eliminates the long memory access latency of
LUTs, hence significantly shortens the simulation time, es-
pecially when CSM-NN computations can take advantage of
parallelism offered by graphical processing units (GPUs) [17].
1
ar
X
iv
:2
00
2.
05
29
1v
1 
 [c
s.L
G]
  1
3 F
eb
 20
20
Ci(V̅)
Vi
CO(V̅)Io(V̅)
CM(V̅)
Vo
V̅=[VI,VO]
(a) CSM for single input (INV) logic gate.
VA
CA(V̅)
CB(V̅) CO(V̅)
CMA(V̅)
CMB(V̅)
VB
IO(V̅)
VO
IN(V̅) CN(V̅)
V̅=[VA,VB,VN,VO]
(b) CSM for two-input (NAND2) logic gate.
Fig. 1: CSM examples for one and two input logic cells [11],
[14].
The major contributions of our work are as follows:
• We developed a framework for simulating nonlinear be-
havior of complex integrated circuits using optimized NN
structures as well as training and inference algorithms,
according to the underlying CPU or GPU computational
capabilities.
• Our framework is scalable and technology-independent,
i.e., it can efficiently handle increasingly complex tech-
nologies with high PVT variations while maintaining the
accuracy and improving the simulation latency.
The remainder of our paper is organized as follows. Section II
presents a short background on CSM and process variation
issues. Sections III and IV elaborate our CSM-NN framework
and experimental results, respectively. Section V concludes the
paper.
II. BACKGROUND
In this section, we briefly touch upon the basics of CSM
and latency issues related to CSM-LUT memory access.
Each logic gate can be modeled using voltage-dependent
current source as well as (miller and output) capacitance
components [7]. The values of these components can be char-
acterized using HSPICE simulations. The CSM components
of a logic cell can be stored in LUTs and utilized for noise,
timing and power analysis of VLSI circuits [11], [14], [15],
[18]. Fig. 1 illustrates CSMs for single-input (INV) and multi-
input (NAND2) logic cells.
TABLE I: CSM for simple logic cells. Number of LUT
dimensions (#Dim), i.e. the count of inputs, outputs and
internal voltage nodes; voltage-dependent capacitances (C :
CM , Co, Ci) and current sources (I : ID, IN ) required to
model the cell; and the total size to be stored in memory
(LUT − Size). All CSM-components are considered to be
represented with 32bit (4Byte) floating points (FP). Character-
ization resolution is assumed to be 10 points per dimension.
Gate #Dim. Variables Table Size
INV 2 3× C, 1× I 4× 102 FPs = 1.6KB
NAND2 4 6× C, 2× I 8× 104 FPs = 320KB
NOR2 4 6× C, 2× I 8× 104 FPs = 320KB
AOI 6 9× C, 3× I 12× 106 FPs = 48MB
NAND3 6 9× C, 3× I 12× 106 FPs = 48MB
NOR3 6 9× C, 3× I 12× 106 FPs = 48MB
XOR2 8 12× C, 4× I 16× 108 FPs = 6.4GB
Given a large number of simulation runs needed during the
ASIC design and verification flow, and the corresponding long
memory retrieval time, it is desirable to keep the number of
dimensions and size of LUTs very small. Table I lists the size
of CSM LUTs for a simple library of basic gates.
The size of CSM-LUTs for simple logic cells (c.f. Table I)
is an exponential function of logic cell complexity. As an
example, NOR2 LUTs are 200 times larger than the one for
INV, and XOR2 LUTs are 20,000 times larget than NOR2
ones. Note that in practical research or industrial standard
cell libraries, there may be many logic cells of various sizes
and complexities, some of which could be more complex than
simple logic cells in Table I.
Looking at the memory hierarchy details of Intel Broadwell
micro-architecture [19] in Table II and comparing them with
sizes in Table I, confirms that CSM LUTs cannot fit in any of
the caches and should be stored in the main memory (DRAM)
and written into cache in parts. The latency of memory access
in DRAM is about 2 orders of magnitude higher than that
of L1 cache. This main difference shows the extent of longer
simulation latencies for CSM-LUT, compared to V-LUT.
In the following two sections, we present how our CSM-
NN eliminates the need for LUTs, and instead utilizes NNs to
compute the CSM data.
III. CSM-NN FRAMEWORK
The description of our CSM-NN, including NN architecture
and optimization algorithms for training is as follows.
A. NN Architecture and Computation
To avoid the large LUTs with long query latencies in CSM-
LUT, our CSM-NN, embeds parametric nonlinear models that
can be trained on fully-connected NNs, to represent nonlinear
functions.
We believe CSM-NN can benefit from the following ML
developments: (1) evolution of novel ML algorithms can be
2
TABLE II: Latency values for information retrieval from
different hierarchy levels of memory and hardware specifi-
cations of Intel Xeon E5-2699 v4 server processor with Intel
Broadwell micro-architecture. The computational capability of
the processor is given in Giga floating point operations per
seconds (GFLOPs).
Intel Broadwell micro-architecture
Memory Size (KByte) Latency (Clock Cycle)
L1 Data Cache 32 4-5
L2 Cache 256 11-12
L3 Cache 20,480 38 - 42
DRAM - ≈ 250
Intel Xeon Processor E5-2699 v4
Cores 22
Base Frequency 2.2 GHz
Single Precision 774.4 GFLOPs
Double Precision 1548.8 GFLOPs
utilized towards improving the accuracy and efficiency of
CSM-NN; and more importantly (2) exponential increase in
computational capabilities, especially with recent advances
in design of GPUs [20], significantly helps improving the
performance of CSM-NN.
CSM-NN substitutes memory retrieval with computation,
thus it is necessary to analyze and optimize the number of
different structure and latency of operations required for CSM-
NN in different hardware platforms.
There are two steps for CSM-NN: (1) simulation using
a feed-forward pass that calculates the output of the model
based on trained parameters and input values, and (2) back-
propagation step, which modifies the parameters of the model
based on the error, i.e. the difference between the expected
values of the training data and the estimated output from
the model. Since the training process is done only once,
computation during back-propagation is not a concern. Our
objective is to improve the circuit simulation time. We there-
fore focus mainly on the inference process, i.e., we optimize
the computation steps of the feed-forward pass.
To choose the best NN architecture for our CSM-NN, we
note that the number of hidden layers and the number of
neurons in the hidden layer(s) determine the total number of
parameters in the input-output function and the flexibility of
the model. Increasing the number of hidden layers beyond one
(i.e., making the model deeper) instead of increasing the num-
ber of neurons in a single layer (i.e., making the layer wider)
can also be considered. In deep neural networks (DNNs),
the sequence of nonlinear activation layers enables the input-
output dependency to have a higher degree of nonlinearity with
more flexibility. Although there are still unanswered questions
on profound results of DNNs [21], the belief is that multiple
layers perform better at generalizing as they learn the interme-
diate features between the raw input-data and the high-level
output [21], [22]. As an example, thanks to the availability
of data and computation resources in the past few years,
the state-of-the-art solutions for challenging ML problems,
such as image classification in the fields of computer vision,
are made possible by creating models with over hundreds of
layers [23] [24]. On the other hand, shallow networks do not
generalize well but are very powerful in memorization [21]. In
addition, training deeper models requires more data and time
for training and also needs more computational resources for
the feed-forward pass.
In conclusion, despite the recent emergence of the DNN
solutions and applications and potential improvement of ac-
curacy of circuit simulation for complex timing, noise, and
power analysis, we do not believe DNN is a feasible choice
for the architecture of CSM-NN.
In the mathematical theory of artificial neural networks
(ANNs), the universal approximation theorem [25] affirms
that a single-hidden-layer NN can approximate continuous
functions with a finite number of neurons, under assumptions
over the nonlinear activation function and availability of
sufficient data for training. Consequently, if a shallow wide
network is trained with every possible input value, it could
eventually memorize the corresponding output. The following
characteristics of our problem further suggest that shallow
wide networks with one hidden layer are more plausible
solutions:
• There are no discontinuity in CSM component values.
• While in practical applications the training data is limited
or expensive to generate, in CSM-NN it is straight for-
ward to generate training data with HSPICE simulations
during the characterization process.
• The number of inputs to the neural network is relatively
small, even for complex logic cells, and when considering
PVT parameters (Table I). This implies that we are
modeling a low dimensional function.
Based on these features and considering the impact on
inference step during circuit simulations, CSM-NN adopts a
simple NN architecture with a single hidden layer to model the
nonlinear behavior of CSM-NN components. The architecture
and input-output function are shown in Fig. 2 and Eq. 1.
x = [x0 = 1, x1, x2, ..., xD]
gi =
D∑
d=0
wl=1id xd = w
l=1
i x
ai = σ(gi) =
1
1 + egi
−→ a = [1, a1, ..., aH ]
y =
H∑
h=0
wl=2ih ah = w
l=2a
=
H∑
h=0
σ(
D∑
d=0
wl=1id xd)
(1)
The number of MUL operations in feed-forward pass is
equal to the number of model-parameters as calculated in
Eq. 2. It is very important to note that there are no depen-
dencies among MUL steps in a specific layer, therefore they
can be completely parallelized.
3
Nj Ny
NH
N1w11
w1i
wi1
wij
wiH
wDH
wjy
w1y
wHy
y
x1
D: Input size H: Hidden layer size
xi
xD
Fig. 2: One layer NN architecture used in CSM-NN. x, y,
w, and N are the inputs, output, weights, and the neurons
respectively. The number of inputs (i.e., the dimension) and
the width of the hidden layer are represented with D and H ,
respectively.
#MUL = D ·H +H = (D + 1) ·H (2)
Considering notation used in Eq. 1, there are H summations
of D values in the hidden layer. These H summations also
can be parallelized completely. To calculate the output, the
summation of H values is required. This summation can
be efficiently parallelized by using tree-structures. The total
number of ADD operations and the latency of tree-structure
summations are calculated in Eq. 3 and Eq. 4.
#ADD = D ·H +H = (D + 1) ·H (3)
Latency = dlog2De+ dlog2He (4)
CSM-NN accounts for the availability of resources when
applying parallelization. NNs can be trained and utilized in two
different hardware platforms, namely CPUs and GPUs. The
evolution of GPUs and CPUs in case of number of floating-
point operations per second (FLOPS) are shown in Fig. 3.
1) CPU: There are two phases of CSM-NN simulation
computation when using CPUs: first, the weights of the NNs
are loaded from the memory; and second, MUL and ADD
operations are performed by arithmetic logic units (ALUs).
As later described in Section IV, the number of CSM-NN
parameters is sufficiently small. Therefore, they can fit into
the cache (L1) of a CPU, and are accessible by the ALU in
the order of a few CPU clock cycles.
2) GPU: The computational capabilities of GPUs have
increased dramatically in the past decade. This has made GPUs
a good choice of hardware platform for NN computation [20].
There are two levels of parallelized processing units in
GPUs: several multiprocessors (MPs), and several stream
Year
G
FL
O
Ps
100
500
1000
5000
1000
5000
2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018
Intel Xeon CPUs NVIDIA Tesla GPUs
Fig. 3: Theoretical peak FLOPs with single precision.
processors (SPs, also referred as cores) that run the actual
computation for each multiprocessor. Each core is equipped
with ADD and MUL arithmetic units and designated register
files. By implementing a trained NN (fixed parameters) on a
GPU, the weights of each operation can be stored in register
files, therefore, retrieval of information from memory is not
required. We will show in Section IV that NNs of our CSM-
NN framework can fit into a typical GPU. As an example, the
hardware specifications of an NVIDIA GPU equipped with
CUDA [26] cores is shown in Table III.
TABLE III: NVIDIA Tesla P100 GPU Specifications.
Streaming Processors (SM) 56
32bit FP CUDA core (per SM / total) 64 / 3584
64bit FP CUDA core (per SM / total) 32 / 1792
Register file per SM 256 KB
Shared memory per SM 96 KB
Register file per CUDA core 4 KB
Total L1 cache 64 KB
Base clock frequency 1328 MHz
Single Precision GFLOPs 9519
It is worth noting that LUT-based models such as CSM-
LUT and V-LUT models are only dependent on memory
queries, thus using GPUs will not improve their simulation
time. Therefore, considering relatively stronger parallelization
capabilities of GPUs over CPUs, the speed advantage of CSM-
NN over CSM-LUT and V-LUT improves, when running on
GPUs.
B. Training Process
We have adopted L-BFGS as the optimization technique for
training the NNs of our CSM-NN framework. The following
provides our justification. There are several gradient descent
based optimization algorithm candidates such as stochastic
gradient descent (SGD), Nesterov, Adagrad, and ADAM [27]
to be considered for the training of neural regression models.
SGD and inherited algorithms, such as ADAM, are by far
the most popular algorithms to optimize NNs [28]. Their ad-
vantages to other techniques include parallelization, fast com-
putation, and use of minibatch training techniques for better
4
generalization specially in DNNs. The functionality of these
methods is conditioned to the appropriate tuning of hyper-
parameters for training. On the other hand, Quasi-Newton
methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS),
can be orders of magnitude faster than SGD. These methods
are based on measuring the curvature of the objective function
to select the length and direction of the steps. The main
shortcoming of BFGS is that it requires high computation and
memory resources when calculating the inverse of Hessian ma-
trix for large datasets. Limited memory BFGS (L-BFGS) [29]
is an optimization algorithm in the family of quasi-Newton
methods that approximates the BFGS algorithm using a limited
amount of memory.
The experimental results for low dimensional problems
in [30] show that L-BFGS produces highly competitive or
sometimes superior models compared to SGD methods. An-
other important advantage of L-BFGS is that it requires adjust-
ing zero (and in advanced modified versions of L-BFGS, only
a few) hyper-parameters. For example, differently from SGD,
the learning rate (step-size) of L-BFGS is tuned internally. We
should also note that while several mini-batch versions of L-
BFGS have been suggested very recently in the literature [31],
L-BFGS is generally considered as a batch algorithm and thus
no batch-size adjustment is required. Considering these speci-
fications, we chose L-BFGS as our optimization technique for
training the NNs in the CSM-NN method.
The common approach in supervised learning is to verify
the generalization of the trained model by utilizing a validation
(test) dataset which is completely separate from the training
dataset. This process would prevent the possible over-fitting of
the model. Therefore, we can randomly select samples from
characterization data and test the accuracy of model.
It is very important to note that while accuracy of NNs in
predicting CSM component values is important, the accuracy
should ultimately be measured based on the quality of the
output signal waveforms. Even the measurement of the propa-
gation delay of the gate is not sufficient to confirm the accuracy
of a CSM simulator. Therefore, similar to [11], [32], we used
expected waveform similarity (Esim) as a figure of merit for the
measurement of the accuracy of our CSM simulations. In this
work, Esim is defined as the mean of the absolute difference
between precise HSPICE and CSM-NN simulations relative to
the supply voltage value of the technology as shown in Eq. 5.
Esim =
1
T×VDD
∫ T
0
|VSPICE −VCSM−NN| (5)
C. CSM-NN Flow
Technology information and standard cell libraries at the
transistor level are provided by semiconductor manufacturers
and design parties. Each of the cells in the standard library
should be characterized separately for every PVT corner and
mode settings. The number of different MCMM settings is
technology and product design policy dependent. The charac-
terization process is usually very time intensive, and can be
done in different resolutions. While higher resolutions result in
higher accuracies, they need a longer characterization times. It
should be mentioned that more data needs a larger memory in
CSM-LUT and possibly a longer training process in CSM-NN.
Therefore, choosing an appropriate resolution is an important
step in both CSM-LUT and CSM-NN flows. While our results
in section IV are technology specific, they suggest a range of
acceptable characterization resolutions. Up to this point of the
flow, CSM-NN steps coincide with those of CSM-LUT.
The next step is to train the NNs, one for every CSM
component (e.g. Io), of a logic cell (INV ) and in a specific
PVT corner (e.g. fast-fast and high temperature (FFHT)). The
inputs of the NNs are the voltages of terminal and internal
nodes (VI , VO), and the target output is the value of the CSM
components in these voltage points (CM (VI , VO)).
The training data collected through characterization should
first be preprocessed and then used for training. As explained
in section III-A, wider network can result in a more accurate
model, but requires more computation. Hence, we need to find
an appropriate layer size. We choose the smallest number of
neurons such that the network can pass a pre-defined accuracy
threshold in terms of Esim.
In the following section, we will show that this optimal set
of NN parameters can fit into the cache (L1) of a typical CPU
or the register files of a typical GPU. To simulate a circuit in
a specific MCMM setup, the corresponding NN models of all
logic cells in the standard library are loaded.
IV. EXPERIMENTS AND SIMULATION RESULTS
We implemented the simulator and the flow of our CSM-
NN framework in Python. Our implementation is technology
independent and can characterize, and create NN models with
flexible configurable setup, for any given combinational circuit
netlist. NN implementation and training are based on the
Scikit-learn [33] package.
CPU and GPU devices introduced in Table II and Table III
are used for comparison between two platforms, as both
products are introduced in the same year (2016) and their
current retail prices are in the same order (of about 5,000
USD). In this following, we discuss our experiments including
challenges regarding our specific problem setup.
A. Selected Technologies
In this work, for better evaluation of our CSM-NN in-
cluding its technology independent characteristics, we per-
formed our experiments on both MOSFET (16nm) and Fin-
FET (20nm) device technologies from Predictive Technology
Model (PTM) [34] packages. Two device types namely low-
standby power (LP) and high performance (HP) are used in
our experiments [35].
As technology scales down, a growing number of physical
and fitting parameters are needed to model PVT variations.
However as pointed out in [36]–[39], only a few of them are
dominant, i.e., developing simulation models that account for
those dominant parameters while ignoring the rest, provides
sufficiently high accuracy levels. Following these studies, we
considered the most important process variation factors for
5
defining a limited number of process corners. There is no
process variation distribution information available for PTM
technologies. Therefore, we followed the same approach used
in [40] which studied the same devices as this work to define
PVT corners.
All distributions but temperature are considered normal
(Gaussian) and reported as N(µ, σ) with (µ) and (σ), repre-
senting mean and standard deviation respectively. The typical
temperature value is considered as 27°C and the highest
temperature (+3σ variation) as 125°C. The information of the
distribution for process variation parameters and the defined
process corners for experiments are provided in Table IV.
TABLE IV: Process (P), Voltage (V), and Temperature (T)
variation distributions of technologies used in experiments.
The values of process attributes are reported for NMOS/NFET
devices. t is representing oxide thickness (tox) for MOSFET
and Fin thickness (tfin) for FinFET. The distributions are all
Normal(µ, σ) and represented as (µ, σ).
PVT Variation Distribution
Technology Vdd(v) ΦM (ev) NA(1e20) t(nm)
Fin-LP 0.9,0.05 4.6,0.23 - 15,0.5
Fin-HP 0.9,0.05 4.4,0.22 - 15,0.5
MOS-LP 0.9,0.05 4.6,0.23 2,0.1 1.2,0.04
MOS-HP 0.7,0.035 4.4,0.23 2,0.1 0.95,0.03
PVT variation in pre-defined corners
Corner Vdd(v) T (°C) ΦM (ev) NA t(nm)
FF +3σ 0σ −3σ +3σ −3σ
SS −3σ 0σ +3σ −3σ +3σ
FFHT +3σ +3σ −3σ +3σ −3σ
SSHT −3σ +3σ +3σ −3σ +3σ
B. Characterization
The resolution of characterization process is a key factor
in determining the accuracy of both CSM-LUT and CSM-NN
simulations. While more data points increase the accuracy of
both simulators, it comes with the cost of longer characteri-
zation process, larger tables in CSM-LUT, and longer training
time in our CSM-NN. We therefore evaluate our CSM-NN
framework under different resolutions. The results can also be
later used towards suggesting a baseline for other technologies.
It should be mentioned that CSM-components exhibit dif-
ferent sensitivity levels to different voltage-node variables. For
example, CO seems to be more sensitive to VO than VI in
INVX1, and it can be characterized with lower resolution for
VI than VO. Moreover, the sensitivity to resolution of charac-
terization for one CSM-component should not be necessarily
the same as the other component. For example, the range of
change in Io value for a single INV X1 transition is from µA
to nA, while this is about only 50% for CO. The resolution
can also vary based on the range of the voltage-node variable,
e.g., higher resolutions for the noisy parts of the waveform
(with higher frequencies of change) and lower resolutions for
smooth parts of the waveforms.
However, for the sake of simplicity, we considered all
voltage-node resolutions as similar. As the units for different
dimensions are different, we defined three different resolution
setups as explained in Table V. By comparing the preliminary
results, normal setup was found to be an appropriate resolution
and the experiments were continued with this setup.
TABLE V: Characterization resolution settings used in our
experiments.
S: Soft N: Normal C: Coarse
Resolution (v) 0.01 0.05 0.1
C. Preprocessing and Loss Function Modification
Mean Square Error (MSE, also referred as L2-norm error)
is a commonly used regression loss function. It is simply
the average of squared distances between our targets (yi) and
predicted values (yˆi). The loss function can also accommodate
regularization term added to the loss function in order to
prevent overfitting by shrinking the model parameters. The
values of CSM-components vary in a large scale. For example,
in INV, with VI , VO as variables, the DC current ID is
in micro − scale when both transistors are on, while in
nano− scale when one of them is off and the cell is leaking.
The MSE-loss is a function of absolute error. Thus, by using
this loss, the error in lower scale values will be less important
compared to the higher scale values. To address this, we can
log-transform the output, so the relative error will be used for
loss calculation of the regression model as shown in Eq.6. An
issue with such an adjustment is that some of the values are
negative and this makes the log-transform more complicated.
We simply resolved such issue with a simple shift of data
toward positive values by subtracting all data points with their
overall minimum ymin.
MSE =
∑N
i=0(yi − yˆi)2
N
MSElog =
∑N
i=0(log(yi − ymin)− log(yˆi − ymin))2
N
(6)
TABLE VI: Choice of NN hidden layer size for single and
two input logic cells.
TT FF SS FFHT SSHT
MOSFET-HP 16nm
INV 14 16 18 16 18
NAND2 24 28 30 28 30
MOSFET-LP 16nm
INV 20 20 24 22 26
NAND2 28 32 30 32 32
FinFET-HP 20nm
INV 20 20 20 26 24
NAND2 34 30 34 36 36
FinFET-LP 20nm
INV 20 20 20 26 20
NAND2 30 36 36 40 38
6
TABLE VII: CSM simulation results of a full adder circuit in both FinFET and MOSFET technologies. The simulation time
improvements (A) is the ratio of the time required for CSM-LUT simulation over the one for CSM-NN. While CSM-LUT
results would not improve if they were run on a GPU (instead of a CPU), the improvement results for the CPU and GPU
implementation of CSM-NN are reported as ACPU and AGPU respectively. The hardware platform’s specs are reported in
Table II and III. Esim is the measure of accuracy introduced in Eq. 5.
- MOSFET-HP 16nm MOSFET-LP 16nm FinFET-HP 20nm FinFET-LP 20nm
Corner Esim. ACPU AGPU Esim. ACPU AGPU Esim. ACPU AGPU Esim. ACPU AGPU
Nominal <2% 9.3 16.8 <2% 9.3 16.8 <2% 6.9 15.1 <2% 8.6 16.8
FF <2% 8.6 16.8 <1% 7.4 15.1 <2% 8.6 16.8 <2% 6.8 15.1
SS <2% 8.6 16.8 <2% 6.9 15.1 <1% 6.9 15.1 <1% 6.8 15.1
FFHT <2% 8.6 16.8 <1% 7.4 15.1 <1% 6.8 15.1 <2% 6.6 15.1
SSHT <2% 8.6 16.8 <1% 7.4 15.1 <1% 6.8 15.1 <1% 6.6 15.1
The normalization of data in regression problems would
help the solvers with faster convergence and better numer-
ical stability. Hence, normalization of inputs and outputs is
typically implemented inside the solver, such as that in the
Scikit-learn package [33] used in our implementation.
D. NN Size and Training for Logic Cells
To select the size of the hidden layer for each model, we
repeated the training process for various neuron numbers in
the range of 10 − 50. Preliminary results in our experiments
showed that the tanh nonlinear function provides better out-
comes compared to other functions such as sigmoid and ReLU.
As mentioned in Section III-B, there is no hyper-parameter,
e.g., no learning-rate or mini-batch size tuning is required in
L-BFGS optimization.
The total number of generated data points is 500 per gate.
We trained the NN with 90% of this data (5-fold cross-
validation, 360 for training and 40 for validation) and then
tested on the other 10%. The split between training, validation,
and test datasets was done in random.
Next, we applied a few noisy input smaples to the cell and
measured Esim. The minimum size of the hidden layer that
met Esim < 1% is chosen as the CSM-NN architecture for the
logic cell in the specific MCMM setup. The complete results
for the choice of architecture for INV and NAND2 are given
in Table VI for different MCMM setups.
E. Circuit Simulation
In this work we evaluated our CSM-NN framework by
simulating a full-adder circuit (schematic shown in Fig. 4).
For the sake of a fair comparison, the HSPICE characteriza-
tion setup is the same for both CSM-NN and CSM-LUT. We
measured Esim by comparing output waveforms of HSPICE
as the baseline with those of CSM-NN simulations. The CPU
and GPU devices used in our experiments are introduced in
Table II and Table III respectively. CSM-LUT is considered
to be computed on the CPU platform as it does not benefit
from GPU parallelization. The required computation resources
and latencies are calculated using equations in section III-A.
The results confirm that CSM-NN output waveforms match
those of HSPICE in regard to propagation delay with error
values limited to 0.1%. To better confirm the high accuracy of
A
B
Cin
Cout
S
Fig. 4: Gate level schematic of the full adder circuit used in
our experiments.
CSM-NN, we compared its waveform similarity to HSPICE,
by measuring Esim. As listed in Table VII, Esim is limited
to 2%.
V. CONCLUSIONS AND FUTURE WORK
CSM-NN, a scalable, technology-independent circuit simu-
lation framework is proposed. CSM-NN is aimed to address
the efficiency concerns of the existing tools that depend on
data query from lookup tables stored in memory. Given the
underlying CPU and GPU parallel processing capabilities, our
framework replaces memorization by computation, utilizing
a set of optimized NN structures, training and inference
processing steps. The simulation latency of CSM-NN was
evaluated in multiple MOSFET and FinFET technologies
based on predictive technology models in various PVT corners
and modes. The results confirm that CSM-NN improves the
simulation speed by up to 6× using CPU platforms, compared
to a CSM-LUT baseline. CSM-NN can further benefit from
parallelization capabilities of GPUs, therefore the simulation
speed is improved by up to 15× when run on a GPU. CSM-NN
also provides high accuracy levels, maintaining the waveform
similarity error within 2% compared to HSPICE. We believe
the application of CSM-NN in future simulation tools such as
those for sign-off and MCMM analysis and optimization of
advanced VLSI circuits can significantly improve the simula-
tion accuracy and speed.
As part of our future work, we plan to investigate CSM-
NN on industrial circuits using accurate foundry technology
information including PVT variations. We also plan to enhance
7
our NNs to account for PVT corner parameters as inputs, to be
able to train NNs once for all modes and corners and evaluate
the cost vs speed and accuracy trade-off.
ACKNOWLEDGEMENT
This research was sponsored in part by a grant from the
Software and Hardware Foundations (SHF) program of the
National Science Foundation. The authors would also like to
thank Soheil Nazar Shahsavani and Mahdi Nazemi (of the
University of Southern California) for helpful discussions.
REFERENCES
[1] A. B. Kahng, U. Mallappa, and L. Saul, “Using machine learning to
predict path-based slack from graph-based timing analysis,” in Interna-
tional Conference on Computer Design (ICCD), 2018, pp. 603–612.
[2] M. Pedram and S. Nazarian, “Thermal modeling, analysis, and man-
agement in VLSI circuits: Principles and methods,” Proceedings of the
IEEE, vol. 94, no. 8, pp. 1487–1501, Aug 2006.
[3] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design
techniques for system-level dynamic power management,” IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 3,
pp. 299–316, June 2000.
[4] Cadence Inc., San Jose, California, U.S. Ca-
dence encounter library characterization datasheet. [Online].
Available: https://www.cadence.com/content/cadence-www/en{ }US/
documents/Archive/Archive1/library{ }characterizer{ }ds.pdf
[5] Synopsys Inc., Mountain View, California, U.S. PrimeTime datasheet.
[Online]. Available: https://www.synopsys.com/content/dam/synopsys/
implementation{&}signoff/datasheets/primetime-ds.pdf
[6] S. V. Amit Goel, “Current source based standard cell model for accurate
signal integrity and timing analysis,” Design, Automation and Test in
Europe (DATE), pp. 574–579, 2008.
[7] J. F. Croix and D. F. Wong, “Blade and razor: cell and interconnect delay
analysis using current-based models,” in Design Automation Conference
(DAC), 2003, pp. 386–389.
[8] R. Goyal and N. Kumar, “Current based delay models: A must for
nanometer timing,” Cadence Live Conference (CDNLive), 2005.
[9] I. Keller, Ken Tseng, and N. Verghese, “A robust cell-level crosstalk
delay change analysis,” in International Conference on Computer-Aided
Design (ICCAD), 2004, pp. 147–154.
[10] A. Goel and S. Vrudhula, “Statistical waveform and current source based
standard cell models for accurate timing analysis,” in Design Automation
Conference (DAC), 2008, pp. 227–230.
[11] B. Amelifard, S. Hatami, H. Fatemi, and M. Pedram, “A current source
model for CMOS logic cells considering multiple input switching and
stack effect,” in Design, Automation and Test in Europe (DATE), 2008,
pp. 568–573.
[12] C. Knoth, H. Jedda, and U. Schlichtmann, “Current source modeling
for power and timing analysis at different supply voltages,” in Design,
Automation Test in Europe (DATE), 2012, pp. 923–928.
[13] S. Nazarian, H. Fatemi, and M. Pedram, “Accurate timing and noise
analysis of combinational and sequential logic cells using current source
modeling,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 19, no. 1, pp. 92–103, Jan 2011.
[14] H. Fatemi, S. Nazarian, and M. Pedram, “Statistical logic cell delay
analysis using a current-based model,” in Design Automation Conference
(DAC), 2006, pp. 253–256.
[15] H. Fatemi, S. Nazarian, and M. Pedram, “A current-based method for
short circuit power calculation under noisy input waveforms,” in Asia
and South Pacific Design Automation Conference (ASP-DAC), 2007, pp.
774–779.
[16] T. Cui, Y. Wang, X. Lin, S. Nazarian, and M. Pedram, “Semi-analytical
current source modeling of FinFET devices operating in near/sub-
threshold regime with independent gate control and considering process
variation,” in Asia and South Pacific Design Automation Conference
(ASP-DAC), 2014, pp. 167–172.
[17] S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco,
“Gpus and the future of parallel computing,” IEEE Micro, vol. 31, no. 5,
pp. 7–17, Sep. 2011.
[18] S. Hatami and M. Pedram, “Efficient representation, stratification, and
compression of variational CSM library waveforms using robust prin-
ciple component analysis,” in Design, Automation and Test in Europe
(DATE), 2010, pp. 1285–1290.
[19] Intel broadwell (2018) CPU micro-architecture specification. [Online].
Available: https://www.7-cpu.com/cpu/Broadwell.html
[20] R. Raina, A. Madhavan, and A. Y. Ng, “Large-scale deep unsupervised
learning using graphics processors,” in International Conference on
Machine Learning (ICML), 2009, pp. 873–880.
[21] H. Mhaskar, Q. Liao, and T. Poggio, “When and why are deep networks
better than shallow ones?” in AAAI Conference on Artificial Intelligence,
2017, pp. 2343–2349.
[22] T. Wiatowski and H. Bo¨lcskei, “A mathematical theory of deep convo-
lutional neural networks for feature extraction,” IEEE Transactions on
Information Theory, vol. 64, no. 3, pp. 1845–1866, March 2018.
[23] C. Szegedy, Wei Liu, Yangqing Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
convolutions,” in Computer Vision and Pattern Recognition (CVPR),
2015, pp. 1–9.
[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Computer Vision and Pattern Recognition (CVPR), 2016.
[25] B. C. Csa´ji, “Approximation with artificial neural networks,” Master’s
thesis, Faculty of Sciences, Eo¨tvo¨s Lora´nd University, Hungary, 2001.
[26] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel
programming with cuda,” Queue, vol. 6, no. 2, pp. 40–53, Mar. 2008.
[Online]. Available: http://doi.acm.org/10.1145/1365490.1365500
[27] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016, http://www.deeplearningbook.org.
[28] S. Ruder, “An overview of gradient descent optimization algorithms,”
arXiv, vol. abs/1609.04747, 2016.
[29] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for
large scale optimization,” Mathematical Programming, vol. 45, no. 1,
pp. 503–528, Aug 1989.
[30] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y.
Ng, “On optimization methods for deep learning,” in International
Conference on Machine Learning (ICML), 2011, pp. 265–272.
[31] R. Bollapragada, D. Mudigere, J. Nocedal, H.-J. M. Shi, and P. T. P.
Tang, “A progressive batching L-BFGS method for machine learning,”
in International Conference on Machine Learning (ICML), 2016.
[32] D. Sinha, V. Zolotov, S. K. Raghunathan, M. H. Wood, and K. Kalafala,
“Practical statistical static timing analysis with current source models,”
in Design Automation Conference (DAC), 2016, pp. 113:1–113:6.
[33] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, Nov. 2011.
[34] “Predictive Technology Model from arizona state university,” http://ptm.
asu.edu/, accessed: 2019-05-20.
[35] M. S. Abrishami, A. Shafaei, Y. Wang, and M. Pedram, “Optimal
choice of FinFET devices for energy minimization in deeply-scaled
technologies,” in International Symposium on Quality Electronic Design
(ISQED), 2015, pp. 234–238.
[36] X. Zhang, D. Connelly, P. Zheng, H. Takeuchi, M. Hytha, R. J. Mears,
and T. K. Liu, “Analysis of 7/8-nm Bulk-Si FinFET technologies for
6T-SRAM scaling,” IEEE Transactions on Electron Devices, vol. 63,
no. 4, pp. 1502–1507, April 2016.
[37] H. Dadgour, Vivek De, and K. Banerjee, “Statistical modeling of
metal-gate work-function variability in emerging device technologies
and implications for circuit design,” in International Conference on
Computer-Aided Design (ICCAD), 2008, pp. 270–277.
[38] T. Matsukawa, S. O’uchi, K. Endo, Y. Ishikawa, H. Yamauchi, Y. X. Liu,
J. Tsukada, K. Sakamoto, and M. Masahara, “Comprehensive analysis
of variability sources of FinFET characteristics,” in Symposium on VLSI
Technology, 2009, pp. 118–119.
[39] Xiao Zhang, Jing Li, M. Grubbs, M. Deal, B. Magyari-Ko¨pe, B. M.
Clemens, and Y. Nishi, “Physical model of the impact of metal grain
work function variability on emerging dual metal gate MOSFETs and
its implication for sram reliability,” in International Electron Devices
Meeting (IEDM), 2009, pp. 1–4.
[40] Y. Li, C. Hwang, T. Li, and M. Han, “Process-variation effect, metal-gate
work-function fluctuation, and random-dopant fluctuation in emerging
CMOS technologies,” IEEE Transactions on Electron Devices, vol. 57,
no. 2, pp. 437–447, Feb 2010.
8
