An 8-bit In Resistive Memory Computing Core with Regulated Passive
  Neuron and Bit Line Weight Mapping by Zhang, Yewei et al.
1An 8-bit In Resistive Memory Computing Core with
Regulated Passive Neuron and Bit Line Weight Mapping
Yewei Zhang, Student Member, IEEE, Kejie Huang, Senior Member, IEEE, Rui Xiao, Student Member, IEEE,
and Haibin Shen
Abstract—The rapid development of Artificial Intelligence (AI)
and Internet of Things (IoT) increases the requirement for
edge computing with low power and relatively high processing
speed devices. The Computing-In-Memory(CIM) schemes based
on emerging resistive Non-Volatile Memory(NVM) show great
potential in reducing the power consumption for AI computing.
However, the device inconsistency of the non-volatile memory
may significantly degenerate the performance of the neural
network. In this paper, we propose a low power Resistive RAM
(RRAM) based CIM core to not only achieve high computing
efficiency but also greatly enhance the robustness by bit line
regulator and bit line weight mapping algorithm. The simulation
results show that the power consumption of our proposed 8-
bit CIM core is only 3.61mW (256*256). The SFDR and SNDR
of the CIM core achieve 59.13 dB and 46.13 dB, respectively.
The proposed bit line weight mapping scheme improves the
top-1 accuracy by 2.46% and 3.47% for AlexNet and VGG16
on ImageNet Large Scale Visual Recognition Competition 2012
(ILSVRC 2012) in 8-bit mode, respectively.
Index Terms—In-memory computing, Non-volatile memory,
Neuromorphic chip, Resistance inconsistency, Weight quantiza-
tion and mapping
I. INTRODUCTION
IN the past decade, with internet of things, cloud comput-ing, computer vision, and artificial intelligence becoming
increasingly connected to do perception, cognition, decision,
and interaction, sensing devices in intelligent products are
going to be the key interfaces to the real world. However,
communication, storage, information retrieval, computation,
and recognition will face great challenges due to the ex-
tremely large amount of sensing data. Because of the sepa-
ration of the data acquisition, processing, and analysis, the
conventional intelligent systems are suffering from problems
like high construction cost, high power consumption, low
efficiency, and long latency [1]. To address these issues, the
majority of AI computations will be moved to light-weight
IoT devices [2]. Nevertheless, Moores Law has come to the
end and the processor performance will be benefited little
from Complementary Metal Oxide Semiconductor (CMOS)
technology node scaling down. Therefore, we have to design
new hardware architectures and software algorithms to meet
Authors: K. Huang and H. Shen are with the College of Information
Science & Electronic Engineering, Zhejiang University, 38 Zheda Road,
Hangzhou, China, 310027, and also with Zhejiang Lab, Building 10, China
Artificial Intelligence Town, 1818 Wenyi West Road, Hangzhou City, Zhe-
jiang Province, China, email: huangkejie@zju.edu.cn; shen hb@zju.edu.cn Y.
Zhang and R. Xiao are with the College of Information Science & Electronic
Engineering, Zhejiang University, 38 Zheda Road, Hangzhou, China, 310027,
email: yeweizhang@zju.edu.cn; xiaor@zju.edu.cn
the requirement of the perception, computation, and storage
at the end devices with limited computation capability and
storage resources.
The high density and low power emerging resistive Non-
Volatile Memory (NVM) [3]–[11] which enables massive par-
allel Computing In-Memory (CIM) is a promising candidate to
solve the above-mentioned issues [12]–[14]. The majority of
works are utilizing the multilevel resistance of the resistive
memory for both storage and computation [15]–[18]. For
example, Hewlett Packard Laboratories (HPL) proposed a Dot
Product Engine (DPE) with the inverting amplifier [19]. [20]
designed In-Situ Analog Arithmetic in Crossbars (ISAAC)
which utilizes eight 4 level RRAM cells to represent 16-bit
weight. Though resistive NVM provides a potential solution
as the CIM unit, its non-ideal properties greatly degenerate
the reliability of the system. A few widely known properties
of resistive NVM are the non-linear resistance value with
different biasing voltage, level to level resistance variation, and
cell to cell resistance variation, which will cause significant
errors in quantization, resulting in the accuracy loss in the
network. To reduce the mapping errors and improve the
linearity of the CIM system, a more reliable design is needed
which may be at the cost of the increasing of the computing
energy.
[21], [22] proposed Serial-Input Non-Weighted Product
(SINWP) whose inputs are modulated by time instead of
the analog voltage, which will address the non-linearity issue
caused by the biasing voltage. However, the digital-to-time
converter will greatly increase the computing time at high
data width. [23] proposed a novel Multiple Binary RRAM
with Active Integrator (MBRAI) CIM core architecture, where
multiple binary RRAM cells are used to store one weight.
MBRAI could save a lot of power because binary code is
used at the input instead of a time signal. Therefore, it requires
only n CIM computations instead of 2n. The n-bit input data
are sequentially computed by the CIM core and weighted
at the output neurons, which greatly improves the linearity
because of the identical input voltage. However, the power
consumption of this scheme is dominated by the operational
amplifier (>95%) [23], which reduces the computing effi-
ciency of the CIM core to 0.61 Tera-Multiply-Accumulates
per Second per Watt (TMACs/s/W). What’s more, the accuracy
is still be influenced by the quantization and the inconstancy
of the resistive NVM cells. To reduce power consumption, a
CIM core with regulated passive integrators is proposed in this
paper. A pseudo-binary weight quantization and bit line weight
mapping method aimed at solving the resistance inconsistency
is also introduced. The simulation results show that the power
ar
X
iv
:2
00
8.
11
66
9v
1 
 [c
s.A
R]
  2
6 A
ug
 20
20
2consumption of the proposed 256*256 CIM core in 8-bit mode
is reduced by 98.2% compared with MBRAI.
The rest of this paper is organized in the following manner.
Section II introduces the background of CIM with resistive
NVMs. Section III shows the proposed circuit, problems
brought by resistance inconsistency, and corresponding opti-
mization. Finally, simulation results are presented in Section
IV with the conclusion in Section V.
II. BACKGROUND AND RELATED WORKS
The majority of the computations in the neural network are
matrix multiplication and accumulate operations, which can be
well implemented by crossbar architecture as shown in Fig. 1.
The processing units shown as black dots multiply the input
from word lines by the stored weight. The neuron represented
by the triangle accumulates the multiplication results at the
same bit line.
 
Word lines
Bit lines
Fig. 1. Microarchitecture of a CIM core. The word lines get input data from
the former neurons while the bit lines is to accumulate data at the neurons
(triangles). The black dots are the computing unit.
In conventional schemes, the weight of the neural network
is stored in SRAM. For example, [24] proposed a 7-bit input
and 1-bit weight MAC using a 10T SRAM cell. However,
the precision is limited by the 1-bit weight and the area
is large due to the SRAM array. Emerging NVMs which
have high density and simpler structure as memory unit will
greatly improve the precision of the weight and reduce the
1
2
...
n-1
n
...
.
..
.
..
.
..
...
DA
C
DAC
DAC
 
8 Integrator cell for each bit line
ADC
8 bit output
Analog fixed-point data converter
..
.
OPA
...
...
...
       
Ax
on
s
OPA OPARf Rf Rf
ADC ADC ADC
G G G
G G G
G G G
Vin
Digital Output
Fig. 2. Basic architecture of HPL’s Dot-Product Engine
Cn-1 Cf-Cn-1
S1
S2
NO
MS
S3
Cn-2 Cf-Cn-2
S1
S2
NM
O
S
S3
C0 Cf-C0
S1
S2
N
MO
S
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
I
n
p
u
t 
Bu
f
fe
r
 
&
T
im
in
g 
C
on
tr
ol
...
...
...
...
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
.
..
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
.
..
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
.
..
...
Dm-1,n-1...Dm-1,0
Dm-2,n-1...Dm-2,0
D0,n-1...D0,0
N
M
OS
P
M
OS
PM
O
S
PM
O
S
PM
OS
NM
O
S
NM
O
S
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
...
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
.
..
C
...
R
R
C
..
.
R
R
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
..
.
Sampling 
Capacitance 
Inside the ADC
WL0
WL0
WL
WLm-1
WLm-1
WL0
WLm-1
Fig. 3. Simplified integration circuit with OPA.
core area for its application on CIM. A design of CIM core
with NVM in DPE is shown in Fig. 2. It employs memristor
crossbar for matrix multiplication where memristor stores the
weight by its resistance. Once the input is converted to analog
voltage by DAC, the output voltage is determined by the
conductance of the resistance as Vout =
∑
VinGRf , where
Rf is the feedback resistance, and G is the conductance of the
cross-point memristor device. After that, the output voltage
is digitalized by the ADC for data transmission. DAC and
ADC, which are power-hungry components, are necessary to
resist the noise and signal deformation in data transmission.
MBRAI proposed in [23] moves the input DACs to the output
and shares the ADC for lower power consumption. RRAM is
chosen as the storage unit for its reconfigurability, high density,
and low power consumption. However, it is a great challenge
to precisely control the resistance value of RRAM. Therefore,
MBRAI utilizes n RRAM cells with binary resistance states
whose high resistance state (HRS) is 0 and low resistance
state(LRS) is 1, to represent an n-bits weight to achieve a high
Effective Number of Bits (ENOB) for weight mapping. Fig.
3 is the simplified integration circuit of MBRAI. The input
is sent in bit by bit from Least Significant Bit (LSB) to Most
Significant Bit (MSB), which is more computing reliable since
every bit is identical in computation. The importance of each
bit of the input data and network weights are weighted by the
charge redistribution at the neurons.
Though MBRAI achieves better computing reliability, there
are still two critical issues that need to be addressed. Firstly,
the power consumed by amplifiers in active integrators ac-
counts for more than 95% of the energy cost of the whole CIM
core. Secondly, the resistance of the RRAM cells has a wide
distribution, resulting in significant quantization errors when
mapping weights of the neural network into the RRAM array.
To address the first issue, a passive integrator is proposed. A
regulator is designed to improve the linearity of the integration
results. The details will be introduced in Section III.A. To
address the second issue, a pseudo-binary quantization and bit
line weight mapping method is proposed to reduce the impact
3of the resistance inconsistency. The details will be introduced
in Section III.B.
Cn-1 Cf-Cn-1
S1
S2
NO
MS
S3
Cn-2 Cf-Cn-2
S1
S2
NM
OS
S3
C0 Cf-C0
S1
S2
N
MO
S
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
I
np
ut
 B
u
f
f
e
r
 
&
Ti
m
in
g 
C
on
tr
ol
...
...
...
...
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
.
.
.
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
.
.
.
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
.
.
.
...
Dm-1,n-1Dm-1,n-2...Dm-1,0
Dm-2,n-1Dm-2,n-2...Dm-2,0
D0,n-1D0,n-2...D0,0
NM
OS
PMO
S
PM
OS
PM
OS
PM
OS
NM
OS
NM
OS
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
..
.
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
..
.
GND
GND
C
.
..
GNDR
GNDR
C
..
.
GNDR
GNDR
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
..
.
Sampling 
Capacitance 
Inside the ADC
Fig. 4. Passive integrator with amplifier removed and its integration process.
III. PROPOSED CIM CORE AND MAPPING METHOD
Although a passive integrator can significantly reduce power
consumption by removing the amplifier, it has a serious non-
linear problem. Fig. 4 shows the passive integrator circuit
and its integration process where the current decreases with
the decreasing of the integrating voltage VC . To improve the
linearity of the circuit, we design an optimized n-bit integral
multiplier shown in Fig. 5:
1) We switch the position of RRAM and transistor in
1T1R so that the reading voltage on the RRAM cell
is mainly determined by the gate and threshold voltages
of the transistor. To differentiate from the conventional
structure, the new structure is named as 1R1T.
2) The saturation current of the transistor in 1R1T can
be influenced by the change of the integrating voltage
because of the channel length modulation effect. To
minimize the impact of the integrating voltage, we add
NMOS T0 at bit line to isolate the integrating voltage
and drain voltage of 1R1T and thus reduce the variation
of the integrating current.
3) Because the load of the bit line is affected by the number
of input lines and the weights’ values, the linearity of
the circuit is still influenced by the change of the source
voltage of T0 (the drain voltage of the 1R1T). Therefore,
a regulator is added at T0 to make sure the stability of
the drain voltage of 1R1T.
4) Besides the nonlinearity in the bit line voltage, the
cell to cell variation makes the devices’ integrating
current inconsistent which decreases the robustness of
the system. To improve reliability, we propose a pseudo-
binary quantization and bit line weight mapping method
with corresponding circuit which utilizes the uncertainty
of resistive NVM to reduce quantization error.
A. CIM Core with Regulated Passive Integrator
1) Core Design: Assuming the n-bit input sequence
is X1, X2, ..., Xl and the weight is W1,W2, ...,Wl, the
multiplication and accumulation(MAC) can be expressed as
Y =
l∑
i=1
XiWi =
l∑
i=1
n−1∑
j=0
2jxi,jWi
=
l∑
i=1
n−1∑
j=0
2j
n−1∑
k=0
2kxi,jwi,k
(1)
where xi,n−1xi,n−2...xi,0 and wi,n−1wi,n−2...wi,0 is the bi-
nary format of Xi and Wi (xi,j , wi,k ∈ (0, 1)), respectively.
It can be observed from Eq. 1 that there are three consecutive
accumulations. The proposed CIM core utilizes n integrator
cells to get
∑l
i=1 xi,jwi,k by charge integration, and the
results are stored in the passive regulated neuron composed
by the capacitance array in Fig. 5 for charge redistribution
to get the
∑l
i=1 xi,jWi. The
∑l
i=1 xi,jWi is also added up
by charge redistribution to get
∑l
i=1XiWi. The integration
for the resistances in the same bit line will be finished
simultaneously by sharing the integrator so that the MAC can
be finished parallelly to achieve a smaller core area and faster
computing speed. Multiple neurons are enabled at a time in
the integration phase when the inputs are divided into n cycles
and calculated from LSB to MSB. After integration and charge
redistribution, the data conversion phase is started for neurons
to convert the analog results into digital output.
2) Integral Multiplier: The word line inputs shown in Fig.
5 are sent in once a bit from LSB to MSB. The process
of multiplication in the integral multiplier includes the
integration phase and the charge redistribution phase. When
in the integration phase, S2 is closed, S1, S3, and S4 are
open. After the integration, the charge is redistributed with S1
and S2 open and S3 and S4 closed in the charge redistribution
phase. Taking Cn−1 as an example, the integrating voltage
after the integration phase is
Vc,n−1 = V −c,n−1 −
VD2
∑l−1
i=0DiT
CfRi
(2)
where V −c,n−1 is the initial voltage of Cn−1, Di is one input
bit of the ith input line, l is the number of input lines, T is the
integration time, Ri is the equivalent resistance of the 1R1T
unit of the ith input line and VD2 is the drain voltage of 1R1T
unit. The capacitances satisfy the following constraint
Cf = 2Cn−1 = 22Cn−2 = 23Cn−3 = . . . = 2nC0 (3)
Assuming there is only one input line and the initial integrating
voltage is set to V c−, the integrating voltage Vs after one step
of charge redistribution is
VS =
Vc,n−1Cn−1 + Vc,n−2Cn−2 + . . .+ Vc,0C0 + V −C C0
Cn−1 + Cn−2 + . . .+ 2C0
= V −C −
VD2T
Cf
(
2−1
l−1∑
i=0
Di,n−1
Ri,n−1
+ 2−2
l−1∑
i=0
Di,n−2
Ri,n−2
+ . . .+ 2−n
l−1∑
i=0
Di,0
Ri,0
)
(4)
4Cn-1 Cf-Cn-1
S1
S2
NO
MS
S3
Cn-2 Cf-Cn-2
S1
S2
NM
OS
S3
C0 Cf-C0
S1
S2
NM
OS
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
In
pu
t 
Bu
ff
er
 &
Ti
mi
ng
 C
on
tr
ol
...
...
...
...
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
...
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
...
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
...
...
Dm-1,n-1...Dm-1,0
Dm-2,n-1...Dm-2,0
D0,n-1...D0,0
NMOS
PMOS
PM
OS
PM
OS
PM
OS
NMOS
NMOS
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
...
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
...
C
...
R
R
C
...
R
R
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
...
Sampling 
Capacitance 
Inside the ADC
WL0
WL0
WL
WLm-1
WLm-1
WL0
WLm-1
Fig. 5. Integration circuit without amiplifier
ADC’s sampling capacitance CS which is connected to
the integrator’s capacitance array is in the meantime used
to add up the n partial products. Let CS = Cf , the new Vout is
Vout = 2
−1 (VS + V −out) (5)
where V −out represents the former voltage of the CS . Assuming
the initial voltage of the CS is Vinit, then after n steps of the
charge redistribution, the voltage change is
4Vout =Vinit −
(
2−nVinit + 2−nVS,0 + . . .+ 2−1VS,n−1
)
=2−n
[
(Vinit − VS,0) + . . .+ 2n−1 (Vinit − VS,n−1)
]
=2−n
n−1∑
j=0
2j4VS,j (6)
where VS,n−1 is the (n−1)th integrating voltage of VS and
4VS,j is the voltage change of VS in the jth integration. As
long as the 4VS,j is designed to represent the result of the∑l
i=0 xi,jWi, Eq. 6 gives the result of
∑l
i=0
∑7
j=0 2
jxi,jWi.
3) Regulated Passive Integrator: By switching the position
of 1T1R in Fig. 3 to 1R1T in Fig. 5, we can get the following
equations
I =
1
2
K2 (VG2 − VR − Vth2)2 (7)
I =
VR
R
(8)
where K2 is the device parameter of T2, Vth2 is its threshold
voltage, R is the resistance of RRAM device, VR is resistance’s
read voltage, I is the integrating current passing through the
1R1T unit. According to Eq. 7 8, we can get
VR = VG2 − Vth2 −
√
2K2R(VG2 − Vth2) + 1− 1
K2R
(9)
The drain voltage of T2 (VD2), which is isolated from the
integrating voltage by T0, satisfies the following equation
Ib =
1
2
K0 (VG0 − VD2 − Vth0)2 (10)
where Ib is the integrating current of the bit line, K0 is the
device parameter of T0. The proposed regulator circuit shown
in Fig. 5 stabilizes the VD2 of the 1R1T units by applying a
negative feedback. T1 works at the saturation region, which
satisfies the following equation
Iref =
1
2
K1 (VD2 − Vth1)2 (11)
where K1 is the device parameter of T1, Vth1 is the threshold
voltage of T1. According to Eq. 10 11, we get
VG0 = Vth1 +
√
Iref
K1
+ Vth0 +
√
Ib
K0
(12)
VD2 = Vth1 +
√
Iref
K1
(13)
Since Iref is a constant, the drain voltage VD2 of the 1R1T
unit is stabilized by the regulator.
5B. Pseudo-binary Quantization and Bit Line Weight Mapping
Method
As the weight of the neural network is quantized to n
bits rather than a continuous value, the quantization errors
when mapping the weight of the neural network into the CIM
system will influence the accuracy of inference. What’s more,
the resistance distribution of resistive NVM may worsen the
quantization. Therefore, it’s necessary to discuss the quanti-
zation method and the corresponding errors in this section.
To reduce the quantization error caused by the cell to cell
variation, a pseudo-binary quantization and bit line weight
mapping method is proposed.
1) Quantization Error with NVM: Quantization is an
important method for compressing the neural network and
accelerating the computation speed, among which uniform
quantization is a basic one. The typical quantizer of uniform
quantization can be expressed as
Q(x) = ∆ ·
⌊
x
∆
+
1
2
⌋
(14)
where ∆ is the quantization step size of some value, x is the
value to be quantized. When the quantization step size (∆) is
small relative to the variation in the signal being quantized,
it is simple to show that the mean squared error which is
also called the quantization noise power produced by such a
rounding operation will be ∆
2
12 . The calculation process is
QE =
∫ ∆
2
0
x2
∆
2
dx =
∆2
12
(15)
The maximum (wmax) and the minimize (wmin) of the data
range and the quantization bits n determine the quantization
step size since they usually have the relationship
∆× 2n = (wmax − wmin) (16)
Considering the resistance distribution, the practical non-
linear quantizer is shown as follows
2n∑
i=1
∆i = (wmax − wmin) (17)
Assuming the resistance distribution is a general normal
distribution represented as
f
(
x|µ, σ2) = 1√
2piσ2
e−
(x−µ)2
2σ2 (18)
The Probability Density Function (PDF) of the quantization
error is the noncentral chi-squared distribution with one
degree of freedom. Then, the mean value of the quantization
error is given by
µ = k+
λ
12
= 1+
µ2
12
= 1+
∆2
12
= 1+
(wmax − wmin)2
12× 22n (19)
and the variance of the quantization error is
σ2 = 2(k + 2λ) = 2 + 4µ2 = 2 + 4∆2
= 2 +
(wmax − wmin)2
22(n−1)
(20)
As we can see, the quantization error is greatly increased
when there is a distribution in resistance. Since the quantiza-
tion errors are accumulated in the network, the accuracy will
be greatly reduced.
2) Resistance Measurement: The proposed quantization
and mapping method needs the resistance value of the RRAM
array in LRS, so we firstly set all memory units to LRS and
read the resistance of the RRAM array by ADC in resistance
reading phase. The reading process consists of integration
phase and charge redistribution phase when the switches are
set different from multiplication. For example, when reading
the resistance unit Rn−1,m−1 in Fig. 5, the switches in the
same bit line with Rn−1,m−1 are used while the others stay
open. In the integration phase, S1, S3, and S4 are open, S2
is closed and the input of (m− 1)th word line is 1 while the
others are 0. The integration result is a typical result of Eq.
2 where l=1, D=1. When ADC read the integrating voltage
during the charge redistribution, S2, S3, and S4 are closed
and S1 is open. The voltage read by ADC is
Vout =
Vinit + VS
2
(21)
where Vinit is the initial voltage for both sampling capaci-
tances and integration capacitances, and VS is the integration
result. The integration process satisfies
Vinit − VS = IT
Cf
=
VD2T
RCf
(22)
where I is the integrating current passing through 1R1T unit,
T is the integration time, R is 1R1T’s resistance and VD2 is
the read voltage of the bit line. Therefore, we can get
R =
VRT
2Cf (Vinit − Vout) (23)
3) Quantization and Mapping Method: Since the
normalized resistance in LRS is not exactly digital 1, a
pseudo-binary code whose importance of bits from MSB
to LSB is still the same as the conventional binary code is
proposed in our mapping schemes. The main difference is
that the value of the pseudo-binary code is related to the
resistance of the memory unit, which is given by
wˆ = rn−1 × 2n−1 + rn−2 × 2n−2 + . . .+ r0 × 20 (24)
where for the LRS, ri is the normalized resistance of
the ith bit (mean value is 1) of the weight. For the high
resistance, since the resistance can be much larger than the
low resistance, ri is set as 0, and the uncertainty of the high
resistance is ignored in this paper. The weight quantization
procedure is from MSB to LSB and we define the condition
as follows
q =!
[(
ri × 2i−1 − wres > 0.5
)
|(ri <= 0.5)|
(
ri × 2i−1 > 2× wres
)] (25)
where ri is the ith bit of normalized memory resistance, and
wres is the remaining weight after partial quantization. The
component (ri <= 0.5) is to abandon the device with too
large resistance in LRS and ri × mi − wres > 0.5 is to
check if the remaining weight is larger than the product of
6the importance of the bit and the resistance of the memory.
The component ri ×m > 2× wres is to minimize the quan-
tization error of the LSB. Because of the memory resistance
distribution, |r0 ×m0 − wres| could be larger than wres. In
other words, the memory should be in high resistance in case
|r0 − wres| > wres to minimize the quantization errors.
However, the initial memory sequence may not be the
best solution to minimize the quantization error. For example,
assuming w=13.4, and four memory units with normalized
resistance 1.05, 1.1, 1.125, 0.93 are used to quantize the
weight. The conventional binary code may give a quantization
error of 0.4 (4’b1101). Based on the given sequence, the
resistance states of the four cells are low, low, high, low,
which reduce the quantization error to -0.33. Furthermore, if
we switch the third cell with MSB, the resistance of the four
cells will be 1.125, 1.1, 1.05, and 0.93. In such a sequence,
the resistance states of the four cells can be set to low, low,
high, and high to minimize the quantization error to 0. This
example shows that the sequence of the memory units is very
important to minimize the quantization error.
The traversal algorithm can be used to search all possible
sequences, and the sequence for the minimal quantization error
is picked and configured in the chip. However, it may require
a long searching time and its computation complexity is O(n!).
Moreover, the cells in the same bit line should be in the same
sequential position. Assuming the size of the weight matrix
need to be quantized is R*1 and the size of RRAM array
is R*C where C means the weight is quantized to C bits,
we propose a greedy mapping algorithm whose loss when
quantizing the ith bit is defined as
loss = max
j∈R
(wi−1,j − wˆi,j)
j=R∑
j=1
(wi−1,j − wˆi,j)2 (26)
where wi−1,j is the remaining value of the jth weight after
partial quantization and wˆi,j is the value quantized by the
pseudo-binary quantization method in the ith bit of the jth
row. Eq. 26 has taken both the worst case and the average
case of the searching results into consideration. This bit line
selection is started from MSB which influences the mapping
result most to LSB. The algorithm traverses the remaining bit
lines and chooses the bit line with minimum loss as the ith
bit. To apply the algorithm in the circuit, the n* MUXn in Fig.
5 is used for switching the connection between the bit lines
of the RRAM array and the integrators. When mapping the
weights to the core, all the RRAMs are set to LRS at first.
Then, according to the proposed mapping method, RRAMs
with value 0 are set to HRS. By using this bit line weight
mapping method, the computation complexity is reduced from
O(n!) to O(n2).
IV. SIMULATION RESULTS
In this section, we do the functional verification of the mul-
tiplication and resistance reading process. Then we evaluate
the circuit with dynamic performance, energy cost on circuit
level and compare it with other CIM schemes on core level
and network level. Finally, we present the robustness of the
circuit. The circuit simulations are done in Cadence Analog
(a)
(b)
Fig. 6. Transient simulation results of (a) integration phase and charge
redistribution phase for one bit of input (b) the core’s multiplication process
of 8’b10111010 as input and 8’b11101100 as weight.
Fig. 7. Process of resistance reading where the state of the switches
controlling it is presented
Mixed Signal (AMS) with a 45nm generic Process Design Kit
(PDK) and the network simulations are done on caffe platform.
A. Functional Verification
1) Multiplication Process Verification: We simulate the
computing process shown in Fig. 6 to check the correctness
of the proposed circuit in 8-bit mode. Fig. 6(a) presents the
integration phase in an integrator, the integrating voltage VC
shown in Fig. 5 is reset to 1V at 384 ns, and the integration
phase starts at 393 ns. After 20 ns, the integration phase is
completed and VC is decreased to 745.2mV. Then the charge
redistribution starts at 415 ns. When charge redistribution
is done, the 8 integrating voltages are converted to Vout.
After that, VC is reset to 1V for the next integration. Fig.
6(b) shows the whole multiplication process of an 8-bit input
(8’b10111010) and 8-bit weight (8’b11101100). The input
7TABLE I
CIM CORE PERFORMANCE COMPARISON BETWEEN MBRAI AND THE
PROPOSED
MBRAI [23] Proposed
Supply Voltage 1.1V 1V
Computing Speed 1.85M/s 1.85M/s
SFDR 67.42dB 59.13dB
SNDR 45.48dB 46.13dB
ENOB 7.26bit 7.37bit
TABLE II
ENERGY COST COMPARISON BETWEEN MBRAI AND PROPOSED CIM
CORE
MBRAI [23] Proposed
Technology 45nm Technology 45nm
Supply Voltage 1.1V Supply Voltage 1V
System Clock 16.7MHz System Clock 16.7MHz
Integral Amlifier 0.22mW Regulator circuit 1.11uW
Core(256*256) 199.68mW Core(256*256) 3.61mW
sequence is sent in from LSB to MSB and after 8 cycles
of integration and charge redistribution, the output voltage
Vout is 831.6mV. Then ADC converts it to digital result as
8’b10101011. The theoretical results of the output voltage
and digital result are 831.5mV and 8’b10101011, respectively.
Therefore, the design achieves its functional requirement.
2) Resistance Measurement Verification: Fig.7 presents the
resistance measuring process of one 1R1T unit where the state
of the switches in the same bit line is simulated. The output
voltage is set to 1V at first and the integration phase is started
at 190 ns. Since only one 1R1T is working, the integrating
current is small and thus the integrating time is set to 110
ns which is much longer than that of MAC operation. After
integration, the sampling phase (i.e. the charge redistribution
phase) starts at 440ns and the output voltage is 0.994 V. Then
the ADC converts it to digital output.
B. Performance Evaluation
1) Circuit Level Performance: Table I shows the dynamic
performance comparison between MBRAI and the proposed
core. The computing speed, SFDR, SNDR, ENOB of the
proposed CIM core are 1.85M/s, 59.13dB, 46.13dB, and
7.37bit, which are close to the performance indicators of
MBRAI. Table II gives the power cost comparison between
the proposed scheme and MBRAI. MBRAI consumes 0.22
mW on amplifiers for stable read voltage while the proposed
circuit only consumes 1.11uW on the regulator circuit, and the
total power consumption of the core(256*256) is reduced by
98.2%.
2) Core Level Comparison: The core level comparison
between the proposed scheme and the other CIM core schemes
is shown in Table III. The simulation results show that
the proposed design achieves energy efficiency as high as
553.01 TMACs/s/W in 2-bit input 2-bit weight pattern, 205.30
TMACs/s/W in 4-bit input 4-bit weight pattern, and 33.63
TMACs/s/W in 8-bit input 8-bit weight pattern. Compared
with MBRAI, whose energy efficiency is 77.76 TMACs/s/W
in 1-bit input 3-bit weight pattern, 38.8 TMACs/s/W in 2-
bit input 3-bit weight pattern, and 0.61 TMACs/s/W in 8-
bit input 8-bit weight pattern, the proposed scheme achieves
much higher energy efficiency (55.13 times in 8-bit input 8-
bit weight pattern). Though [27] achieves a low average power
consumption in fixed-4 input and fixed-4 weight pattern, the
throughout of the core is limited by the rate coding scheme.
Meanwhile, the power consumption in [27] will increase
with the input value increase, which may achieve a much
higher power consumption in practice. Comparing with other
CIM schemes, the proposed CIM core achieves better energy
efficiency.
3) Network Level Comparison: The accuracy and energy
estimation comparison between the proposed scheme and other
RRAM based schemes is shown in Table IV. Though the bi-
nary CIM scheme performs well on small-scale networks, the
performance of this scheme on large-scale networks is much
worse than the multibit based schemes because of its 1-bit
quantization. When considering the energy cost, our scheme
reduces 99.81% of inference energy per image compared with
the binary CIM scheme and 98.17% compared with MBRAI
for LeNet on MNIST. The proposed scheme also reduces
99.69% inference energy per image compared with the binary
CIM scheme and 98.64% compared with MBRAI for AlexNet
on ILSVRC 2012. Therefore, by abandoning the amplifiers,
the proposed scheme achieves much lower inference energy
cost.
C. Robustness Analysis
1) Linearity Analysis: The linearity comparison of integra-
tion results under different initial integrating voltage (0.7∼1V)
between the integrator without 1T1R unit position switching,
integrator without T0, and integrator with T0 is shown in
Fig. 8(a), Fig. 8(b), and Fig. 8(c), respectively. The Differ-
ential Nonlinearity (DNL) and Integration Nonlinearity (INL)
are used to evaluate the performance. The INL/DNL is (-
1.66∼0.89)/(-2.19∼1.95) LSB for the integrator without 1T1R
unit position switching, (-0.63∼0.95)/(-1.24∼1.35) LSB for
integrator without T0, and (-0.40∼0.60)/(-0.79∼0.87) LSB
for the integrator with T0 which confirms that the linearity
of the integration process is greatly improved by 1T1R unit
position switching and T0. Fig. 9(a) and Fig. 9(b) present the
linearity evaluation of the proposed integral multiplier with
different input and weight by the code density measurement.
The circuit achieves INL/DNL of (-0.51∼0.36)/(-0.35∼0.28)
LSB, (-0.60∼0.001)/(-0.14∼0.17) LSB corresponding to input
value and weight, respectively. The linearity comparison of
the integral multiplier under different input lines between
the circuit with regulator and without regulator is shown in
Fig. 9(c) and Fig. 9(d), respectively. The INL/DNL is (-
2.01∼-0.38)/(0.01∼0.01) LSB for the circuit with regulator
and (-12.7∼4.26)/(-0.2∼0.62) for the circuit without regulator,
which shows that the linearity in terms of the number of input
lines is significantly improved by the regulator for providing
a relatively stable drain voltage of 1R1T when the loads of bit
line change.
2) PVT Simulation: To verify the robustness of the circuit,
different combinations of process, voltage, and temperature
8TABLE III
CORE LEVEL COMPARISON BETWEEN THE PROPOSED SCHEME AND OTHER CIM SCHEMES
Structure Technology Crossbar Size Weight/Data Bit Throughout(GMACS) Power(mW) Efficiency(TMACs/s/W)
SINWP [21] [22] 55nm 256*512 fixed-3/fixed-1 — — 53.17fixed-3/fixed-2 — — 21.9
MBRAI [23] 45nm 256*256
fixed-3/fixed-1 1524 19.6 77.76
fixed-3/fixed-2 1040 26.8 38.8
fixed-8/fixed-8 121.4 199.68 0.61
A 22nm 2Mb ReRAM CIM Macro [26] 22nm 512*512
fixed-2/fixed-1 — — 121.38
fixed-4/fixed-2 — — 45.52
fixed-4/fixed-4 — — 28.93
Proposed 45nm 256*256
fixed-2/fixed-2 1092.2 1.975 553.01
fixed-4/fixed-4 546.1 2.66 205.30
fixed-8/fixed-8 121.4 3.61 33.63
A CIM SRAM Macro in 7nm FinFET CMOS [27] 7nm 4kb fixed-4/fixed-4 186.2 1.06 175.5
TABLE IV
ACCURACY AND ENERGY ESTIMATION OF DIFFERENT RRAM-BASED SCHEME
Network The Number of Operations Structure System Frequency Data Bit Crossbar Size top-1 error Rate Energy(uJ/img) Saving(%)
LeNet on MNIST 0.42M
BNN+ADCs [25] 100MHz 1 128*128 1.40% 6.68 99.81%
MBRAI [23] 25MHz 8 256*256 0.97% 0.71 98.17%
Proposed 16.7MHz 8 256*256 0.90% 0.013 —
AlexNet on ILSVRC 2012 720M
BNN+ADCs [25] 100MHz 1 128*128 73.90% 5.42E+03 99.69%
MBRAI [23] 25MHz 8 256*256 44.16% 1.23E+03 98.64%
Proposed 16.7MHz 8 256*256 43.60% 16.65 —
TABLE V
PVT SIMULATION ON ENOB
Process ff ss tt
Temperature(◦C) -40 80 -40 80 27
Voltage(V) 0.9 1.1 0.9 1.1 0.9 1.1 0.9 1.1 1
ENOB(bit) 7.36 7.3 7.25 7.1 7.35 7.27 7.05 7.03 7.37
Cn-1 Cf-Cn-1
S1
S2
NO
MS
S3
Cn-2 Cf-Cn-2
S1
S2
NM
OS
S3
C0 Cf-C0
S1
S2
NM
OS
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
In
p
ut
 B
u
ff
er
 
&
Ti
m
in
g 
C
on
tr
ol
...
...
...
...
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
...
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
...
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
...
...
Dm-1,n-1Dm-1,n-2...Dm-1,0
Dm-2,n-1Dm-2,n-2...Dm-2,0
D0,n-1D0,n-2...D0,0
NMOS
PMOS
PM
OS
PM
OS
PM
OS
NMOS
NMOS
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
...
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
...
C
...
R
R
C
...
R
R
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
...
Sampling 
Capacitance 
Inside the ADC
WL0
WL0
WL
WLm-1
WLm-1
WL0
WLm-1
(a)
Cn-1 Cf-Cn-1
S1
S2
N
O
M
S
S3
Cn-2 Cf-Cn-2
S1
S2
N
M
O
S
S3
C0 Cf-C0
S1
S2
N
M
O
S
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
I
n
p
u
t
 
B
u
f
f
e
r
 
&
T
i
m
i
n
g
 
C
o
n
t
r
o
l
...
...
...
.
.
.
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
.
.
.
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
.
.
.
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
.
.
.
...
Dm-1,n-1Dm-1,n-2...Dm-1,0
Dm-2,n-1Dm-2,n-2...Dm-2,0
D0,n-1D0,n-2...D0,0
N
M
O
S
P
M
O
S
P
M
O
S
P
M
O
S
P
M
O
S
N
M
O
S
N
M
O
S
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
.
.
.
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
.
.
.
C
.
.
.
R
R
C
.
.
.
R
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
.
.
.
Sampling 
Capacitance 
Inside the ADC
WL0
WL0
WL
WLm-1
WLm-1
WLm-1
(b)
Cn-1 Cf-Cn-1
S1
S2
NO
MS
S3
Cn-2 Cf-Cn-2
S1
S2
NM
OS
S3
C0 Cf-C0
S1
S2
NM
OS
S3
C0
S2S3
S4
Vdd Vdd Vdd Vdd
N bit output
ADC (charge redistribution in the mean time)
OPA OPA OPA
In
pu
t 
Bu
ff
er
 &
Ti
mi
ng
 C
on
tr
ol
...
...
...
...
Vref Vref Vref
S6
Vdd
Inn-1 Inn-2 In0
Outn-1 Outn-2 Out0
...
ADC
NMOSRn-1,m-1
NMOSRn-1,m-2
NMOSRn-1,0
...
NMOSRn-2,m-1
NMOSRn-2,m-2
NMOSRn-2,0
...
NMOSR0,m-1
NMOSR0,m-2
NMOSR0,0
...
...
Dm-1,n-1Dm-1,n-2...Dm-1,0
Dm-2,n-1Dm-2,n-2...Dm-2,0
D0,n-1D0,n-2...D0,0
NMOS
PMOS
PM
OS
PM
OS
PM
OS
NMOS
NMOS
GND
n* MUXn
T0T0T0
T1T1T1Iref
3、Regulator
3、Bias for 
Regulator
VG0
VD2
VC
VR
VG2
T2
1T1R Cell
1、1R1T unit
Vc,n-1 Vc,n-2 Vc,0 VC-
VS
C
Vc
R
R
R
...
I
Regulated 
Passive 
Neurons
C
R
2、T0
4、MUX for bit line 
weight mapping
S5
S5
S5
R
...
C
...
R
R
C
...
R
R
T0
T2T2
T2 T2 T2
T2 T2 T2
CS
...
Sampling 
Capacitance 
Inside the ADC
WL0
WL0
WL
WLm-1
WLm-1
WL0
WLm-1
(c)
Fig. 8. The INL/DNL comparison of integration results under different
integrating voltage(0.7∼1V) between (a) integrator without 1T1R unit position
switching, (b) integrator without T0, and (c) integrator with T0.
are chosen to do the PVT simulation where ENOB is used
to evaluate the core’s performance. The ENOBs in these PVT
combinations are all greater than 7 bits as shown in Table
V which indicates that the proposed circuit is reliable with
variations of process, voltage, and temperature.
3) Quantization and Mapping Methods Comparison: To
add the impact of resistance distribution into the weight of the
neural network, the resistance reading phase is needed when
the ADC is used to read the resistance of the RRAM array. To
make things easy, the process of ADC reading 1R1T circuit
with fixed resistance is firstly simulated by 1400 Monte Carlo
simulations to evaluate the impact of the transistor variation,
then the resistance inconsistency is evaluated by adding a
normalized Gaussian distribution. The ormalized distribution
of RRAM array read by ADC is shown in Fig.10, where
the standard deviation of the normalized Gaussian distribution
is 0.2. Fig. 11 shows the comparison of 1400 Monte Carlo
simulations on the computation error of combination of i put
180, weight 75, number of input lines 128 betwee normal
mapping and bit line weight mapping method. The average
value and the standard deviation of the error in normal
mapping method are 0.124 LSB and 1.744LSB, respectively,
while those of the errors in bit line weight mapping method are
0.013 LSB and 0.104 LSB, respectively. he mapping result
indicates that the bit line weight mapping method significantly
improves our CIM core’s robustness to variations of device
inconsistency.
To test the effect of the bit line weight mapping method on
network level, three quantization and mapping methods are
simulated. The first one is normal binary quantization and
mapping method, which quantifies the weight to digital 8-
bit value and set the resistance HRS/LRS according to the
corresponding digital bit 0/1. The second one is resistance
based quantization and mapping method that quantifies the
weight according to Eq. 25. The third o e is resistance based
quantization and bit line weight mapping method proposed
in this paper. Fig. 12(a) shows the quantization error ratio
(sum of absolute values of quantization error/sum of absolute
9(a)
(b)
(c)
(d)
Fig. 9. The evaluation of linearity in terms of (a) different input(0-255) and
(b) different weight(0-255) and the INL/DNL comparison between (c) integral
multiplier with regulator and (d) integral multiplier without regulator under
different input lines(1-256).
values of weight) and the loss of top-1 accuracy compari-
son between three methods under different quantization bits
(the deviation of normalized resistance is 0.2) and different
standard deviation of normalized resistance distribution (the
quantization bits is 8) on AlexNet and ILSVRC 2012; Fig.
12(b) presents the quantization error ratio and the loss of top-1
accuracy comparison between quantization methods with dif-
ferent quantization bit (the deviation of normalized resistance
is 0.2) and standard deviation of the resistance distribution
(the quantization bits is 8) on VGG16 and ILSVRC 2012.
As shown in Fig. 12, the optimized quantization and bit line
weight mapping method helps reduce the quantization errors
and improve the inference accuracy both on AlexNet and
VGG16. For example, the accuracy loss in 8-bit mode with
0.2 deviation on AlexNet are 2.97% and 0.51% for normal
binary mapping method and bit line weight mapping method,
respectively, and those on VGG16 are 2.70% and 0.39% for
normal binary mapping method and bit line weight mapping
method, respectively. What’s more, with the uncertainty of the
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Normalized Resistance
0
20
40
60
80
100
Nu
m
be
r o
f R
es
ist
an
ce
Standard Deviation:0.2
Fig. 10. Normalized resistance distribution read by ADC where the standard
deviation of normalized Gaussian distribution is 0.2.
4 2 0 2 4
Mapping Error (LSB)
0
20
40
60
80
100
120
Co
un
t
Mean = 0.124
 SD = 1.744
Input:180 Weight:75 Input lines:128
(a)
0.20 0.15 0.10 0.05 0.00 0.05 0.10 0.15
Mapping Error (LSB)
0
20
40
60
80
Co
un
t
Mean = 0.013
 SD = 0.104
Input:180 Weight:75 Input lines:128
(b)
Fig. 11. Comparison of 1400 Monte Carlo simulations on the computation
error of combination of input 180, weight 75, number of input lines 128
between (a) normal mapping and (b) bit line weight mapping method.
resistance increasing the effect of the optimization is more
evident.
V. CONCLUSION
In this paper, an 8-bit RRAM based CIM core with regulated
passive neuron and bit line weight mapping method has been
proposed. The non-linearity brought by the passive integrator
and the errors caused by quantization and the cell to cell
variation have been discussed. To address the above issues,
the detailed regulated integral multiplier and the bit line
weight mapping method have been presented. The circuit level
simulation has shown that the proposed CIM core achieves
3.61mW on power consumption with the size of 256*256
10
4 5 6 7 8
Quantization Bits
0.00
0.02
0.04
0.06
0.08
0.10
0.12
Lo
ss
 o
f A
cc
ur
ac
y
A
B
C
4 5 6 7 8
Quantization Bits
0.4
0.6
0.8
1.0
1.2
1.4
Su
m
 o
f Q
ua
nt
iz
at
io
n 
Er
ro
r/
Su
m
 o
f W
ei
gh
t(1
0
4 ) AB
C
0.0
25
0.0
50
0.0
75
0.1
00
0.1
25
0.1
50
0.1
75
0.2
00
Normalized Standard Deviation
0.000
0.005
0.010
0.015
0.020
0.025
0.030
0.035
0.040
Lo
ss
 o
f T
op
1 
Ac
cu
ra
cy AB
C
0.0
25
0.0
50
0.0
75
0.1
00
0.1
25
0.1
50
0.1
75
0.2
00
Normalized Standard Deviation
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Su
m
 o
f Q
ua
nt
iza
tio
n 
Er
ro
r/
Su
m
 o
f W
ei
gh
t(1
0
4 ) AB
C
(a)
4 5 6 7 8
Quantization Bits
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
Lo
ss
 o
f A
cc
ur
ac
y
A
B
C
4 5 6 7 8
Quantization Bits
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Su
m
 o
f Q
ua
nt
iz
at
io
n 
Er
ro
r/
Su
m
 o
f W
ei
gh
t(
10
4 ) A
B
C
0.0
25
0.0
50
0.0
75
0.1
00
0.1
25
0.1
50
0.1
75
0.2
00
Normalized Standard Deviation
0.00
0.01
0.02
0.03
0.04
Lo
ss
 o
f T
op
5 
Ac
cu
ra
cy A
B
C
0.0
25
0.0
50
0.0
75
0.1
00
0.1
25
0.1
50
0.1
75
0.2
00
Normalized Standard Deviation
0.0
0.2
0.4
0.6
0.8
1.0
Su
m
 o
f Q
ua
nt
iza
tio
n 
Er
ro
r/
Su
m
 o
f W
ei
gh
t(1
0
4 ) A
B
C
(b)
Fig. 12. The accuracy comparison between A: Resistance based quantization and bit line weight mapping method, B: Normal binary quantization and mapping
method, and C: Resistance based quantization and normal mapping method. (a) The left two figures are quantization error ratio (sum of absolute values of
quantization error/sum of absolute values of weight) and loss of top-1 accuracy of AlexNet on ILSVRC 2012 with different quantization bits (the deviation of
normalized resistance is 0.2) while the right two are quantization error ratio and loss of top-1 accuracy of AlexNet on ILSVRC 2012 with different standard
deviation of normalized resistance distribution (the quantization bits is 8). (b) The left two figures are quantization error ratio and loss of top-1 accuracy of
VGG16 on ILSVRC 2012 with different quantization bits (the deviation of normalized resistance is 0.2) while the right two are quantization error ratio and
loss of top-1 accuracy of VGG16 on ILSVRC 2012 with different standard deviation of normalized resistance distribution (the quantization bits is 8).
in 8-bit input and 8-bit weight mode, which is reduced by
98.2% compared with MBRAI while the SFDR and SNDR of
the CIM core achieve 59.13 dB and 46.13 dB, respectively.
The network level simulation has shown that the CIM core
achieves 0.90% top-1 error rate with 0.013 uJ/img on LeNet
and 43.60% top-1 error rate with 16.65 uJ/img on AlexNet,
which are better than other schemes. The linearity and PVT
simulation has been done to verify the robustness of the circuit.
The simulation on mapping methods has shown that compared
with normal mapping method, the proposed bit line weight
mapping scheme achieves better performance which improves
the top-1 accuracy by 2.46% and 3.47% for AlexNet and
VGG16 on ILSVRC 2012 in 8-bit mode.
ACKNOWLEDGMENT
This work was supported by the Major Scientic Research
Project of Zhejiang Lab (No. 2019KC0AD02).
REFERENCES
[1] K. Chang and M. Chiang. Design of data reduction approach for aiot
on embedded edge node. In 2019 IEEE 8th Global Conference on
Consumer Electronics (GCCE), pages 899–900, 2019.
[2] H. Pham, M. Nguyen, and C. Sun. Aiot solution survey and comparison
in machine learning on low-cost microcontroller. In 2019 International
Symposium on Intelligent Signal Processing and Communication Sys-
tems (ISPACS), pages 1–2, 2019.
[3] G. W. Burr, R. M. Shelby, C. di Nolfo, J. W. Jang, R. S. Shenoy,
P. Narayanan, K. Virwani, E. U. Giacometti, B. Kurdi, and H. Hwang.
Experimental demonstration and tolerancing of a large-scale neural
network (165,000 synapses), using phase-change memory as the synaptic
weight element. In 2014 IEEE International Electron Devices Meeting,
pages 29.5.1–29.5.4, 2014.
[4] K. Huang, Y. Ha, R. Zhao, A. Kumar, and Y. Lian. A low active leakage
and high reliability phase change memory (pcm) based non-volatile fpga
storage element. IEEE Transactions on Circuits and Systems I: Regular
Papers, 61(9):2605–2613, 2014.
[5] L. Zhang, W. Kang, H. Cai, P. Ouyang, L. Torres, Y. Zhang, A. Todri-
Sanial, and W. Zhao. A robust dual reference computing-in-memory
implementation and design space exploration within stt-mram. In 2018
IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pages
275–280, 2018.
[6] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memory
with spin-transfer torque magnetic ram. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 26(3):470–483, 2018.
[7] Y. Pan, P. Ouyang, Y. Zhao, W. Kang, S. Yin, Y. Zhang, W. Zhao,
and S. Wei. A mlc stt-mram based computing in-memory architec-
ture for binary neural network. In 2018 IEEE International Magnetics
Conference (INTERMAG), pages 1–1, 2018.
[8] S. Jain, A. Ranjan, K. Roy, and A. Raghunathan. Computing in memory
with spin-transfer torque magnetic ram. IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, 26(3):470–483, 2018.
[9] H. . P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen,
and M. Tsai. Metaloxide rram. Proceedings of the IEEE, 100(6):1951–
1970, 2012.
[10] W. Wan, R. Kubendran, S. B. Eryilmaz, W. Zhang, Y. Liao, D. Wu,
S. Deiss, B. Gao, P. Raina, S. Joshi, H. Wu, G. Cauwenberghs, and
H. . P. Wong. 33.1 a 74 tmacs/w cmos-rram neurosynaptic core with
dynamically reconfigurable dataflow and in-situ transposable weights for
probabilistic graphical models. In 2020 IEEE International Solid- State
Circuits Conference - (ISSCC), pages 498–500, 2020.
[11] Z. Yang and L. Wei. Logic circuit and memory design for in-memory
computing applications using bipolar rrams. In 2019 IEEE International
Symposium on Circuits and Systems (ISCAS), pages 1–5, 2019.
[12] Z. Liu, E. Ren, F. Qiao, Q. Wei, X. Liu, L. Luo, H. Zhao, and
H. Yang. Ns-cim: A current-mode computation-in-memory architecture
enabling near-sensor processing for intelligent iot vision nodes. IEEE
Transactions on Circuits and Systems I: Regular Papers, pages 1–14,
2020.
[13] C. Xue and M. Chang. Challenges in circuit designs of nonvolatile-
memory based computing-in-memory for ai edge devices. In 2019
International SoC Design Conference (ISOCC), pages 164–165, 2019.
[14] W. Chen, W. Khwa, J. Li, W. Lin, H. Lin, Y. Liu, Y. Wang, Huaqiang
11
Wu, Huazhong Yang, and M. Chang. Circuit design for beyond von
neumann applications using emerging memory: From nonvolatile logics
to neuromorphic computing. In 2017 18th International Symposium on
Quality Electronic Design (ISQED), pages 23–28, 2017.
[15] G. W. Burr, P. Narayanan, R. M. Shelby, S. Sidler, I. Boybat, C. di
Nolfo, and Y. Leblebici. Large-scale neural networks implemented
with non-volatile memory as the synaptic weight element: Comparative
performance analysis (accuracy, speed, and power). In 2015 IEEE
International Electron Devices Meeting (IEDM), pages 4.4.1–4.4.4,
2015.
[16] J. Jang, S. Park, G. W. Burr, H. Hwang, and Y. Jeong. Optimization
of conductance change in pr1xcaxmno3-based synaptic devices for
neuromorphic systems. IEEE Electron Device Letters, 36(5):457–459,
2015.
[17] A. Fumarola, P. Narayanan, L. L. Sanches, S. Sidler, J. Jang, K. Moon,
R. M. Shelby, H. Hwang, and G. W. Burr. Accelerating machine learning
with non-volatile memory: Exploring device and circuit tradeoffs. In
2016 IEEE International Conference on Rebooting Computing (ICRC),
pages 1–8, 2016.
[18] E. Giacomin, T. Greenberg-Toledo, S. Kvatinsky, and P. Gaillardon.
A robust digital rram-based convolutional block for low-power image
processing and learning applications. IEEE Transactions on Circuits and
Systems I: Regular Papers, 66(2):643–654, 2019.
[19] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves,
S. Lam, N. Ge, J. J. Yang, and R. S. Williams. Dot-product engine
for neuromorphic computing: Programming 1t1m crossbar to accelerate
matrix-vector multiplication. In 2016 53nd ACM/EDAC/IEEE Design
Automation Conference (DAC), pages 1–6, 2016.
[20] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar. Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars.
In 2016 ACM/IEEE 43rd Annual International Symposium on Computer
Architecture (ISCA), pages 14–26, 2016.
[21] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,
T. Chang, T. Chang, T. Huang, H. Kao, S. Wei, Y. Chiu, C. Lee, C. Lo,
Y. King, C. Lin, R. Liu, C. Hsieh, K. Tang, and M. Chang. 24.1 a
1mb multibit reram computing-in-memory macro with 14.6ns parallel
mac computing time for cnn based ai edge processors. In 2019 IEEE
International Solid- State Circuits Conference - (ISSCC), pages 388–
390, 2019.
[22] C. Xue, W. Chen, J. Liu, J. Li, W. Lin, W. Lin, J. Wang, W. Wei,
T. Huang, T. Chang, T. Chang, H. Kao, Y. Chiu, C. Lee, Y. King, C. Lin,
R. Liu, C. Hsieh, K. Tang, and M. Chang. Embedded 1-mb reram-
based computing-in- memory macro with multibit input and weight for
cnn-based ai edge processors. IEEE Journal of Solid-State Circuits,
55(1):203–215, 2020.
[23] S. Zhang, K. Huang, and H. Shen. A robust 8-bit non-volatile
computing-in-memory core for low-power parallel mac operations. IEEE
Transactions on Circuits and Systems I: Regular Papers, 67(6):1867–
1880, 2020.
[24] A. Biswas and A. P. Chandrakasan. Conv-ram: An energy-efficient
sram with embedded convolution computation for low-power cnn-based
machine learning applications. In 2018 IEEE International Solid - State
Circuits Conference - (ISSCC), pages 488–490, 2018.
[25] T. Tang, L. Xia, B. Li, Y. Wang, and H. Yang. Binary convolutional
neural network on rram. In 2017 22nd Asia and South Pacific Design
Automation Conference (ASP-DAC), pages 782–787, 2017.
[26] C. Xue, T. Huang, J. Liu, T. Chang, H. Kao, J. Wang, T. Liu, S. Wei,
S. Huang, W. Wei, Y. Chen, T. Hsu, Y. Chen, Y. Lo, T. Wen, C. Lo,
R. Liu, C. Hsieh, K. Tang, and M. Chang. 15.4 a 22nm 2mb
reram compute-in-memory macro with 121-28tops/w for multibit mac
computing for tiny ai edge devices. In 2020 IEEE International Solid-
State Circuits Conference - (ISSCC), pages 244–246, 2020.
[27] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W. Khwa, H. Liao,
Y. Wang, and J. Chang. 15.3 a 351tops/w and 372.4gops compute-
in-memory sram macro in 7nm finfet cmos for machine-learning appli-
cations. In 2020 IEEE International Solid- State Circuits Conference -
(ISSCC), pages 242–244, 2020.
个人简历
姓名 章烨炜 性别 男 出生年月 1996 年 1月 8日 
民族 汉 学历 硕士 专业 电子科学与技术 
通讯地址 
浙江省杭州
市浙江大学
玉泉校区 6
舍 328 
邮编 310027 电话 17816855041 
教育背景
2011 年-2014 年就读于浙江省绍兴市新昌中学。
2014 年-2018 年就读于浙江大学信电学院电子科学与技术专业。 
2019 年-2021 年在浙江大学信电学院电子科学与技术专业读取硕士学位。 
获奖经历 2016 年 11 月获校三等奖学金。
项目经历 
参与用非易失性存储器实现存内计算的芯片设计，并在芯片上实现语音唤醒
的神经网络部署的项目。在其中主要实现存内计算乘加电路的设计与仿真。
实习经历 
无 
校内活动 
无 
Yewei Zhang (Student Member, IEEE) recieved
the bachelors degree from College of Information
Science & Electronic Engineering, Zhe Jiang Uni-
versity in 2018. He is currently studying for a
master’s degree at College of Information Science &
Electronic Engineering, Zhe Jiang University. He is
interested in in-memory computing and non-volatile
memories.
Kejie Huang (Senior Member, IEEE) received the
Ph.D. degree from the Department of Electrical En-
gineering, National University of Singapore (NUS),
Singapore, in 2014. He has been a Principal In-
vestigator with the College of Information Science
Electronic Engineering, Zhejiang University (ZJU),
since 2016. Prior to joining ZJU, he has spent five
years at the IC design industry, including Samsung
and Xilinx, two years in the Data Storage Insti-
tute, Agency for Science Technology and Research
(A*STAR), and another three years in the Singapore
University of Technology and Design (SUTD), Singapore. He has authored
or coauthored more than 30 scientific articles in international peer-reviewed
journals and conference proceedings. He holds four granted international
patents, and another eight pending ones. His research interests include low
power circuits and systems design using emerging non-volatile memories,
architecture and circuit optimization for reconfigurable computing systems
and neuromorphic systems, machine learning, and deep learning chip design.
He currently serves as the Associate Editor of the IEEE TRANSACTIONS
ON CIRCUITS AND SYSTEMS-PART II: EXPRESS BRIEFS.
Rui Xiao (Student Member, IEEE) received her bachelor’s degree from the School of 
Information Science and Electronic Engineering, Zhejiang University in 2019. She is currently 
working for her Ph.D. degree in the School of Information Science and Electronic Engineering, 
Zhejiang University. Her research interests include in-memory computing, non-volatile 
memories, and neuromorphic systems. 
RuiXiao (Student Member, IEEE) received her
bachelors degree from the School of Information
Science and Electronic Engineering, Zhejiang Uni-
versity in 2019. She is currently working for her
Ph.D. degree in the School of Information Science
and Electronic Engineering, Zhejiang University.
Her research interests include in-memory comput-
ing, non-volatile memories, and neuromorphic sys-
tems.
Haibin Shen is currently a Professor with Zhejiang
University, a member of the second level of 151 tal-
ents project of Zhejiang Province, and a member of
the Key Team of Zhejiang Science and Technology
Innovation. His research interests include learning
algorithm, processor architecture, and modeling. His
research achievement has been used by many au-
thority organizations. He has published more than
100 papers on academic journals, and he has been
granted more than 30 patents of invention. He was a
recipient of the First Prize of Electronic Information
Science and Technology Award from the Chinese Institute of Electronics, and
has won a second prize at the provincial level.
