Processing-In-Memory Acceleration of Convolutional Neural Networks for
  Energy-Efficiency, and Power-Intermittency Resilience by Roohi, Arman et al.
Processing-In-Memory Acceleration of Convolutional Neural
Networks for Energy-Efficiency, and Power-Intermittency
Resilience
Arman Roohi∗, Shaahin Angizi∗, Deliang Fan, and Ronald F DeMara
Department of Electrical and Computer Engineering, University of Central Florida, Orlando, 32816 USA
∗The first two authors contributed equally
Abstract—Herein, a bit-wise Convolutional Neural Network
(CNN) in-memory accelerator is implemented using Spin-Orbit
Torque Magnetic Random Access Memory (SOT-MRAM) com-
putational sub-arrays. It utilizes a novel AND-Accumulation
method capable of significantly-reduced energy consumption
within convolutional layers and performs various low bit-
width CNN inference operations entirely within MRAM. Power-
intermittence resiliency is also enhanced by retaining the partial
state information needed to maintain computational forward-
progress, which is advantageous for battery-less IoT nodes.
Simulation results indicate ∼5.4× higher energy-efficiency and
9× speedup over ReRAM-based acceleration, or roughly ∼9.7×
higher energy-efficiency and 13.5× speedup over recent CMOS-
only approaches, while maintaining inference accuracy compa-
rable to baseline designs.
I. INTRODUCTION
Due to their impressive performance on image recogni-
tion tasks, deep Convolutional Neural Network (CNNs) offer
significant potential advantages for use on large-scale data-
sets. However, the processing demands of high-depth CNNs
spanning hundreds of layers face serious challenges for their
tractability in terms of memory and computational resources.
This so-called ”CNN power and memory wall” has been mo-
tivating the development of alternative approaches to improve
CNN efficiency at both software and hardware levels [1].
In algorithm-based approaches, use of shallower CNN mod-
els, quantizing parameters [2], and network binarization [3]
have been explored extensively. Recently, utilizing weights
with low bit-width and activations reduces both model size
and computing complexity. For instance, performing bit-wise
convolution between the inputs and low bit-width weights
has been demonstrated in [2] by converting conventional
Multiplication-And-Accumulate (MAC) operations into their
corresponding AND-bitcount operations. However, such con-
version cannot necessarily guarantee high efficiency opera-
tion in a hardware implementation that may engage various
aspects of instruction encoding and operand access. In an
extreme quantization, Binary Convolutional Neural Network
(BCNN) has achieved acceptable accuracy on both small [4]
and large datasets [3] by relaxing the demands for some
high precision calculations. Instead, it binarizes weight/input
while processing the forward path, providing a promising
solution to mitigate aforementioned bottlenecks in storage and
computational components [5].
Figure 1: Proportional relationship for execution time of a
CNN on both CPU and GPU [11].
From the hardware point of view, the underlying opera-
tions should be realized using efficient mechanisms. However,
within conventional isolated computing units and memory
elements interconnected via buses, there are serious chal-
lenges, such as limited memory bandwidth channels, long
memory access latency, significant congestion at I/O choke-
points, and high leakage power consumption [6], [7]. In-
memory processing paradigms built on top of non-volatile
devices, such as Resistive Random Access Memory (ReRAM)
[6], [8], Spin-Transfer Torque Magnetic RAM (STT-MRAM)
[9] and recent Spin Orbit Torque MRAM (SOT-MRAM)
[10], introduced to address the aforementioned concerns. Due
to their interesting features such as non-volatility, near-zero
standby power, high integration density, compatibility with
CMOS fabrication process, and radiation-hardness, these NV-
based systems offer some promising attributes for in-memory
processing implementations.
CNNs realize machine learning classifiers that are capable
of taking an image as an input and then computing the proba-
bility that the image belongs to each designated output class.
Typically, a CNN consists of several convolutional layers in-
cluding convolution, non-linearity, normalization, and pooling
steps, followed by a flatten layer connected to fully-connected
layers. For feature extraction, each convolutional layer receives
a set of features organized into multi-channels referred to as
feature maps. It applies feature detectors(filters) by perform-
ing high-dimensional convolutions. To increase non-linearity
of the pooled feature map, a non-linear activation function,
i.e. rectified linear unit (ReLU), will be applied to the results.
The convolutional layer occupies the largest portion of running
time and consumes significant computational resources in both
GPU and CPU implementations, as depicted in Figure 1.
ar
X
iv
:1
90
4.
07
86
4v
1 
 [c
s.L
G]
  1
6 A
pr
 20
19
This motivates us to propose an optimized bit-wise CNN
in-memory accelerator based on SOT-MRAM computational
sub-arrays. In particular, the bit-wise CNN based on AND-
bitcount operations presented in [2] can be further accelerated
by modifying the algorithm rather than a direct module-by-
module mapping such as IMCE [12].
The remainder of this paper is organized as follows. In
Section II, the proposed accelerator is designed to be partially
power-failure resilient and uses a novel hardware-conforming
AND-Accumulation method to further accelerate convolu-
tional layers in CNN. The non-volatile SOT-MRAM based
structures provide power failure resiliency feature. Moreover,
they are leveraged to develop a bit-wise CNN in-memory
accelerator. Extensive simulation results and detailed analysis
are summarized in Section III including inference accuracy,
energy consumption, and memory storage. Finally, Section
IV concludes this paper by highlighting the features and
advantages of the proposed in-memory accelerator.
II. INTERMITTENT RESILIENT CNN ACCELERATOR
As mentioned in the previous section, most of the CNN
run-time is taken by performing MACs. Therefore, reducing
computation complexity of MAC operations are vital for
resource-constrained systems such as IoT devices. In order
to accelerate MAC operations in convoltional layers, three
main processes including AND operation, bitcount, and bitshift
are leveraged, which realize a bit-wise convolution. A crude
in-memory implementation of such bit-wise operations can
be found in [12], where bitcount and bitshift are directly
implemented using serial counter and shifter units. We believe
such module-by-module mapping not only degrades the bit-
wise convolution performance in hardware, but also imposes
a large in-memory data-transfer due to its intrinsic serial
operations. Hence, we propose a hardware-optimized method
inspired by DoReFa-Net [2] to mitigate these drawbacks, in
addition to address the power-failure issue:
I ∗W =
M−1∑
m=0
N−1∑
n=0
2m+nCMP (AND(Cn(W ), Cm(I)))
CMP (X) =
n∑
i=1
xi, , where X = xnxn−1...x2x1 (1)
where I as input and W weight. The convolution can be
implemented by AND, CMP (rather than bitcount), and parallel
bitshift operations. A general overview of our proposed CNN
accelerator is shown in Figure 2a. This architecture mainly
consists of an Image Bank, a Kernel Bank, computational sub-
arrays, and an Extra Processing Unit (EPU) including three
ancillary units, i.e. Quantizer, Activation Function-Active, and
Batch Normalization-BN. Each computational sub-array is
equipped with three components: CMP as a Compressor unit,
ASR as an Adaptive Shift Register, and NV-FA as a Non-
Volatile Full Adder. As discussed earlier, the convolutional
layer contributes the largest proportion of computation time
and complexity to CNNs. Thus, we mainly focus on this layer.
However, the proposed system architecture can be leveraged
to implement other CNN layers including batch normalization
and pooling layers. We assume both feature maps (I) and
feature detectors (W ) are initially loaded in two sub-banks of
memory. In our approach, inputs should be quantized before
Quantizer
3 )
 A
c c
u m
u l a
t e
input
 fmaps I 1 [ 3 1
]
I 1 [
3 0
]
I 1 [
1 ] ]
I 1 [
0 ]
I 2 [
3 1
]
I 2 [
3 0
]
[ I 2 [
1 ]
[
]
] ]
I 2 [
0 ]
CMP
ASR
S-FA
1 )
 m
a p
p i n
g
A c
t i v
a t
e .
B N
C t
r l
Quantizer
Activate.
BN
EPU
Compute. 
Sub-arrays
4 )
 A
c t i
v a
t i o
n
2 )
 P a
r a
l l e
l  A
N D
M
R D
MCD
1 0 1 0
0 0 0 0
0 0 1 1
0 1 0 1
0 0 0 0
CMP
ASR
S-FAC
t r l
M
R D
MCD
0 1 1 1
0 0 0 0
1 1 1 1
0 0 0 1
1 1 0 1
A c
t i v
a t
e .
B N
4 )
 
4
A c
t i v
a t
i o n
Image 
Bank
Kernel 
Bank
Image 
Bank
Kernel 
Bank
EPU
Banks
weights
Output fmaps
Figure 2: General overview of the proposed intermittent re-
silient CNN accelerator.
mapping to the accelerator, which is performed by EPU’s
Quantizer. Because of page limit of this paper we cannot
thoroughly discuss EPU’s units in details. In the following,
we elaborate two main processing phases of the accelerator.
A. Parallel AND Phase
Figure 4a shows the in-memory processing sub-array archi-
tecture using SOT-MRAM [10], [12], [13]. The array supports
both memory read-write and simple Boolean logic operations
such as AND/XOR. The SOT-MRAM structure includes an
Magnetic Tunnel Junction (MTJ) that its free layer is directly
connected to a Spin Hall Metal (SHM). There are two
stable magnetization states, parallel (low resistance), and anti-
parallel (high resistance), which denote 0 and 1 in binary
information, respectively. Each SOT-MRAM cell requires five
signals, which are common among all MRAM cells to perform
memory operations. There are Write Word Line (WWL), Write
Bit Line (WBL), Read Word Line (RWL), Read Bit Line
(RBL), and a Source Line (SL). (For more details, refer to
[14])
The SOT-MRAM based computational sub-array can be
readily utilized such that the massive AND operations required
for convolutions can be handled. Consider I and W as input
De
co
de
r
DecoderInput fmap
(1) data mapping
0 1
2 3
kernel
1 4
1 2
4 2
2
5
3
0  0  0
0  0  1
0  1  0
0  1  1
0  0  1
1  0  0
0  0  1
0  1  0
C0
(I)
C1
(I)
C2
(I)
m-bit n-bit
C0
(W
)
C1
(W
)
C2
(W
)
C2(I)
=12
C1(I)
C2(W)
C0(W)C1(W)
C0(I)
CMP
ASR
NV-FA
C0(W).C2(I)
C0(W).C1(I)
C0(W).C0(I)
(3) accumulation
0 1 0 0
0 0 0 1
1 0 1 0
0 0 0 0
0 0 1 1
0 1 0 1
0 0 0 0
0
3
De
co
de
r
Decoder
0 1 0 0
0 0 0 1
1 0 1 0
0 0 0 0
0 0 1 1
0 1 0 1
0 0 0 0
(2) parallel AND
C0(W)C2(I)C1(I)C0(I)
De
co
de
r
Decoder
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 0 1
0 0 0 0
0 1 0 0
0 0 0 0
0 0 1 0
0 0 0 0
1 1 0 0 =12
2
m+k
Figure 3: Three-phase in-memory computation.
and kernel of m- and n-bit (for simplicity, 3-bit, as Figure 3),
I is covered by kernel W . The bits of each Ii/ Wi element are
indexed from least significant bit to moat significant bit with
M = [0,m−1]/N = [0, n−1]. Then, a second sequence noted
by Cm(I) can be considered for I including the combination
of mth bit of Ii elements. For example, C2(I) represents the
LSBs of all Ii elements, “0000”. The second sequence for
W can be considered like Cn(W ). Now, by considering the
set of all mth value sequences, the I can be expressed as
I =
∑M−1
m=0 2
mCm(I). Additionally, W can be expressed as
W =
∑N−1
n=0 2
nCn(W ).
To efficiently load the Quantizer unit’s output to computa-
tional sub-arrays, I and W should be tailored. As illustrated in
the data organization and mapping step of Figure 3, C2(W )-
C0(W ) are consequently mapped to the assigned sub-array.
Accordingly, C2(I) − C0(I) are mapped to the following
memory rows similarly. Now, the accelerator can perform the
parallel bit-wise AND operation depicted in Figure 4 within
its computational sub-array.
B. Accumulation Phase
The accumulation phase consists of three main components:
(1) NV 4:2 compressor, (2) adaptive shift register, and (3) NV
full adder.
1) 4:2 Compressor (CMP): Compressors, especially 4:2
and 5:2, are widely used to reduce the delay of the sum-
mation of partial products in multiplier designs. Figure 5a
shows the schematic of a 4:2 compressor and its fundamental
implementation using two serially connected full adders. The
basic equation of the 4:2 compressor is x1 + x2 + x3 + x4
+ Cin = sum + 2 × (carry + Cout). The following equations
express the outputs of the 4:2 compressor:
sum = x1 ⊕ x2 ⊕ x3 ⊕ x4 ⊕ cin
carry = (x1 ⊕ x2 ⊕ x3 ⊕ x4).cin + (x1 ⊕ x2 ⊕ x3 ⊕ x4).x4
Cout = (x1 ⊕ x2).x3 + (x1 ⊕ x2).x1 (2)
WWL1
W
BL
1
SL1
RB
L1
RWL1
WWL2
SL2
RWL2
Ro
w
 D
ec
od
er
Column Decoder
V1
m1
m2
SA
AND
NAND
SA
OR
NOR
XOR
XNOR
Vsense
Vref1
Vref2
SELl 3
ENAND
Iref
ENOR
ENM
Iref
RAND
ROR
RM
IW
RI
TE
 
MTJ
SHM
(b)
De
co
de
r
Decoder
0 1 0 0
0 0 0 1
0 0 1 1
0 1 0 1
0 0 0 0
SA
1 0 1 0
0 0 0 0
(a)
Figure 4: (a) SOT-MRAM Computational sub-array, (b) Monte
Carlo simulation result of Vsense.
Figure 5: Implementation of (a) 4:2 compressor, and (b) pro-
posed architecture with the MUX and XOR-XNOR modules.
These equations can be reformed in a way that XOR/XNOR
modules are only located in a first row and the other
XOR/XNOR elements are replaced by MUXs, as shown
in Figure 5b. Considering the compressor implementation
presented in [15] and the capability of our proposed com-
putational sub-array design to function as an XOR/XNOR
operation including MUX elements, the accelerator can be
configured to implement an optimized 4:2 compressor. The
results of parallel AND operations are written back to the sub-
array and passed through the compressor, which can readily
count the number of “1”s within each resultant vector and pass
it to the next unit. Figure 5b depicts step (1) of the accumula-
tion phase. In our design, we only need to update the memory
contents once to implement XOR/XNOR logic, namely in-
memory XOR computation. Due to the 4:2 compressor, the
bitcount operation can be performed in one clock cycle in-
stead several 1 clock cycles of shifting operations, yielding
considerable reductions in delay and energy. Due to the non-
volatile XOR/XNOR implementation, our 4:2 compressor is
power failure resilient. Moreover, it is power efficient, due to
the optimum number of write operations equal to the sub-array
length. In general, kernel length (nk) determines the number
of compressor’s input (n + 1).
2) Adaptive Shift Register (ASR): Since the number of shift
operations is different and is determined by the locations of
the input and the weight in the sub-array, governed by the
expression: m + n − 2, an adaptive shift register (ASR) is
required. One method to implement an ASR is an addition tree
approach. In general, this structure is composed of 2m+n − 1
bit full adders (FAs), in which the first layer includes 2n−1
FAs, the second layer has 2n−2 FAs, and finally the last
layer consists of one FA. Another approach to designing an
ASR, developed herein, is to implement logic expressions
using multiplexers (MUXs) and then connect them to the flip-
flops (FFs) in an appropriate way. Figure 6 depicts an ASR
design for 4-bit input data, which is able to operate with
three different numbers of shifts, 00=0, 01=1, and 10=2. It
includes seven MUXs, three inverters, four NOR/AND gates,
and six FFs. For instance, assume that IN[3:0] = “1001” and
SHIFT[1:0] is 1 (01), which means SHIFT[0]=1 (red line) and
SHIFT[1]=0 (green line). Because of the MUX-based selection
structure, “0” is stored in FF#5 and FF#0. Then the applied
input is written into FF#1 to FF#4 appropriately/successively,
which produces “010010” as an output. The number of FFs is
determined by the summation of the number of inputs and
the maximum number of possible shift operations. In our
1The number of shift is determined by the memory array size, i.e. 8 bits.
implementation, a 4-bit ASR is developed, which requires six
FFs to perform three possible shift operations.
3) Non-Volatile Full Adder (NV-FA): Due to the usage of
NV elements in the AND-Accumulation process, the structure
has become partially power-failure resilient, meaning it can
restore the system’s operations to the last good/suitable state
under most conditions. While it might fail to restore the last
configuration if power loss occurs during the addition (shift)
operations, the delay for this step is equal to the delays of m+n
FAs, ≈ m+n×58 ps, which is negligible to the total delay of
calculating one fmap. Finally, the output of this level should
be added with the results of the previous inputs (Is × Ws). To
make our design more resilient in presence of power failure,
we developed a NV-FA (NV-FA), as shown in Figure 7a. The
NV-FA includes two NV flip-flops (NV-FFs) in addition to the
regular FA. The NV-FF consists of a volatile CMOS FF and
a NV element.
To remove the additional overhead and issues caused by
the common checkpointing approaches, the summation results
will be written into the NV-elements after computing steps
for a fixed number of frames, i.e. 20 frames. Otherwise, the
states and results of each step store in a volatile FF and sum
up with the upcoming results. We can modify the period of
writing operation based on the power failure rate. In this case,
our checkpointing approach is superior to common energy
harvesting systems, which are usually utilized in intermittent
computing architectures, in terms of area and complexity.
Because herein, common checking-based approaches may
suffer from inconsistencies, both internal and external, after
each power loss. Moreover, peripheral circuits such as voltage
detection systems and capacitor arrays are needed, which is a
crucial challenge for area-constrained IoTs. Figure 7b depicts
the functionality of NV-FA in presence of power failure.
III. EXPERIMENTAL RESULTS
A. Accuracy
Bit-width: We consider 4 different bit-width of W:I (1:1,
1:4, 1:8, 2:20 to explore the accuracy of our accelerator
with an 8-bit gradient. In addition, we consider a 32-bit full-
precision case as the base-line with 32-bit gradient. Data-set:
Figure 6: 4-bit adaptive shift register with three shift modes.
Among various data-sets, we select SVHN [16] with 73257
training digits, 26032 testing digits, and 531131 additional
digits for extra training data. The images are pre-processed
to 40×40 from the original 32×32 cropped version and fed
to the model. CNN Layers: We developed a bitwise CNN
with 6 convolutional, 2 average pooling and 2 FC layers,
which are equivalently implemented by convolutional layers.
Such model costs about 80 FLOPs for each 40×40 image.
To avert further prediction accuracy degradation, we don’t
quantize the first and last layers [2], [3], [5]. Training: We
basically modified the open-source DoReFa-Net [2] algorithm
by integrating new bit-wise convolution function applying the
AND-Accumulation method. To increase the accuracy and
avoid over-fitting, we adopted batch normalization, parameter
tuning and dropout methods. The CNN design is implemented
on TensorFlow [17] running 100 epochs and we extract the test
error of each epoch. Results: Table I shows the computation
complexity and test error of the under-test model. We used
W × I and W × I + W × G to achieve the computation
complexity of inference and training, respectively. Our results
replicate the conclusion drawn by [2], [12] whereby kernels
weights and inputs are progressively more vulnerable to bit-
width reductions.
Table I: Test error of the CNN model on SVHN.
Bit-width Computation Complexity Error (%)
W I Inference Training CNN Model
32 32 - - 2.4
1 1 1 9 3.1
1 4 4 12 2.3
1 8 8 16 2.1
2 2 4 20 1.8
B. Storage
Five bit-width of W:I (32:32, 1:1, 1:4, 1:8, and 2:2) are
selected to evaluate memory storage requirements. The break-
down of memory storage is shown in Figure 8a. We observe
that as the CNN model’s bit-width decreases, less memory
storage is required. For instance, the 1:4 configuration, with
higher inference accuracy compared to 32:32, shows ∼11.7×
memory reduction. To investigate the memory usage in large
data-sets, we implement three different bit-width of W:I
(64:64, 32:32, and 1:1) using AlexNet model on ImageNet
data-set on the proposed accelerator in 8b. We observe that
1:1 bit-width configuration demands ∼40MB memory which
is ∼6× and ∼12× smaller in comparison with the single and
double precision CNNs, respectively.
C. Energy Consumption
In this subsection, we estimate the energy-efficiency of the
CNN model implemented by the proposed accelerator and
state-of-the-art inference acceleration solutions, i.e. ReRAM,
SOT-MRAM, and CMOS-only ASIC. To evaluate the perfor-
mance of the proposed design, the circuit level simulation is
implemented in Cadence Spectre using NCSU 45nm CMOS
PDK [18] in conjunction with a SOT-MRAM resistive model.
The NEGF approach is utilized to extract the MTJ resistance
Figure 7: (a) Circuit level design of proposed NV-FA, and (b) timing diagram for NV-FA operation.
32:32 1:1 1:4 1:8 2:20
1
2
3
4
Me
mo
ry 
Sto
rag
e (
MB
) C1
C2
C3
C4
C5
C6
C7
C8
11.7
32:32 64:64 1:10
100
200
300
400
500
Me
mo
ry 
Sto
rag
e (
MB
) C1
C2
C3
C4
C5
C6
C7
C86
12
(a) (b)
Figure 8: Usage of memory storage by (a) the CNN model for
SVHN date-set, and (b) AlexNet for ImageNet data-set.
(RMTJ) [19], whereas the heavy metal resistance (RSHM) is
determined based on the resistivity and device dimension. Ac-
cordingly, we extensively modified the system-level memory
evaluation tool NVSim [20] to co-simulate with an in-house
developed C++ code simulator based on circuit-level results.
We configure the memory sub-array organization with 256
rows and 512 columns per mat organized in a H-tree routing
manner, 2×2 mats per bank, 8×8 banks per group; in total 16
groups and 512Mb total capacity.
For comparison, a ReRAM-based in-memory accelerator
based on [6] was developed with 64 fully-functional sub-
arrays. For each mat, there are 256×256 ReRAM cells and
eight 8-bit reconfigurable SAs. For evaluation, NVSim sim-
ulator [20] was modified to estimate the system energy and
performance. We adopted the default NVSim’s ReRAM cell
file (.cell) for the assessment. Besides, we developed an IMCE-
like [12] design with the same sub-array configuration as our
design. To compare the result with ASIC accelerators, we
developed a YodaNN-like [21] design with 8×8 tiles for 33MB
eDRAM. Accordingly, we synthesized the design with Design
Compiler [22] under a 45 nm process node. The eDRAM
and SRAM performance were estimated using CACTI [23].
In order to have a fair comparison, the area-normalized results
(performance/energy per area) will be reported henceforth.
Figure 9 demonstrates the proposed accelerator energy-
efficiency results with batch sizes of 1 and 8 in different
configuration spaces of weight and input. We observe that
the proposed accelerator offers the highest energy-efficiency
normalized to area compared to others owing to its fast,
energy-efficient, and parallel operations. Our design shows
∼2.1× better energy-efficiency compared to IMCE. This en-
ergy reduction comes mainly from using a fast, efficient,
in-memory compressor instead of a serial counter in the
accumulation phase. In addition, 5.4× and 9.7× better energy
efficiencies are reported over ReRAM and ASIC accelerators,
respectively.
D. Performance Estimation
Figure 10 compares the throughput in frames per second
normalized to area of the proposed design with other acceler-
ators. We observe that the AND-Accumulation method leads
to ∼3× higher performance than AND-bitcount employed in
IMCE. In addition, it is 9× and 13.5× faster on average
than ReRAM and ASIC-64 solutions. This arises from two
sources: (1) ultra-fast and parallel in-memory operations of the
proposed design compared to multi-cycle ASIC and ReRAM
solutions and (2) the existing mismatch between computation
and data movement in ASIC design. In addition, the ReRAM
design uses matrix splitting approach because of the intrinsi-
cally limited bit levels of ReRAM devices so that excessive
sub-arrays are occupied. This can further limit parallelism
methods [6].
E. Area-Energy trade-off
In this subsection, we evaluate the energy/area of BCNN
resistive processing-in-memory accelerators based on ReRAM
[8] and SOT-MRAM [12]) for inference of one single image
over three well-known data-sets under a 45nm technology
node. Table II demonstrates that the proposed SOT-MRAM-
based accelerator can process BCNN very efficiently compared
to others. Its worth pointing out that the energy reported in
1 8 1 8 1 8 1 8
1e-2
1e-1
1e+0
En
erg
y e
ffi
cie
nc
y/a
rea
 
(fr
./J
/m
m2
)
Proposed IMCE ReRAM ASIC-64
<1:8> <2:2><1:4><1:1>
Figure 9: Energy-efficiency of different accelerators (Y-
axis=Log scale).
1 8 1 8 1 8 1 8
1e-2
1e+0
1e+2
  P
erf
orm
an
ce/
are
a  
 
(fr
./s
/m
m2
)
Proposed IMCE ReRAM ASIC-64
<1:1> <1:4> <1:8> <2:2>
Figure 10: Performance of different accelerators (Y-axis=Log
scale).
Table II consists of the energy of convolution computation
of all layers. We observe that our design can execute binary-
weight AlexNet [3] on ImageNet favorably with 471.8µJ/img
where ∼4.8× and 3.5× smaller energy and area are obtained,
respectively, compared to the ReRAM-based design. In ad-
dition, the proposed accelerator exhibits 1.6× better energy
savings compared to SOT-MRAM IMCE on ImageNet even
though it imposes larger overhead to the memory chip.
Table II: Energy-area comparison of different NVM-based
BCNN accelerators.
ImageNet SVHN MNIST
Designs Energy(µJ/img)
Area
(mm2)
Energy
(µJ/img)
Area
(mm2)
Energy
(µJ/img)
Area
(mm2)
ReRAM
[8] 2275.34 9.19 425.21 0.085 13.55 0.060
IMCE
[12] 785.25 2.12 135.26 0.01 0.92 0.009
Proposed 471.8 2.60 84.31 0.039 0.68 0.012
IV. CONCLUSION
In this work, a bit-wise CNN in-memory accelerator based
on SOT-MRAM computational sub-arrays was proposed. This
new architecture could be leveraged to greatly reduce energy
consumption dealing with convolutional layers and accelerate
low bit-width CNN inference within non-volatile MRAM.
Our device-to-architecture co-simulation results show that
the proposed accelerator can attain ∼5.4× higher energy-
efficiency and 9× speedup compared to ReRAM-based, and,
∼9.7× higher energy-efficiency and 13.5× speedup over ASIC
accelerators holding almost the same inference accuracy to the
baseline CNN on different data-sets.
We plan to extend our future work to mitigate the write-
operations issue for NV elements, which consumes a large
amount of power. Choosing a proper thermal barrier, i.e. 30kT,
for MTJ devices could provide retention times ranging from
minutes to hours and achieve at least 50% energy reduction
compared to nanomagnets with a thermal barrier around 40kT.
The other approach to reduce performance overhead caused by
NV elements is to leverage one NV-FF instead of two NV-FFs
within each FAs. After a specified duration, only Cout will be
stored in a NV-FF while sum is saved in a regular FF. If
power failure occurs, the stored value is considered as both
sum and Cout for the next add operation. In this scenario,
PDP improvements can be achieved at the cost of lower
accuracy. Generally, in a situation with a high occurrence rate
of power failure, the number of completed tasks for a CMOS-
only implementation is significantly reduced, which degrades
performance of the system [14]. Hence, utilizing the power
failure resilient architecture even without further optimization
can avoid high bulk-write energy costs of Flash pages and
complexities from checkpointing/restore protocols.
REFERENCES
[1] R. Andri et al., “Yodann: An architecture for ultra-low power binary-
weight cnn acceleration,” IEEE TCAD, 2017.
[2] S. Zhou et al., “Dorefa-net: Training low bitwidth convolutional neural
networks with low bitwidth gradients,” arXiv preprint arXiv:1606.06160,
2016.
[3] M. Rastegari et al., “Xnor-net: Imagenet classification using binary
convolutional neural networks,” in European Conference on Computer
Vision. Springer, 2016, pp. 525–542.
[4] C. Matthieu et al., “Binarized neural networks: Training deep neural
networks with weights and activations constrained to +1 or-1,” arXiv
arXiv:1602.02830, 2016.
[5] S. Angizi and D. Fan, “Imc: energy-efficient in-memory convolver
for accelerating binarized deep neural network,” in Proceedings of the
Neuromorphic Computing Symposium. ACM, 2017, p. 3.
[6] P. Chi et al., “Prime: A novel processing-in-memory architecture for
neural network computation in reram-based main memory,” in ISCA,
vol. 43, 2016.
[7] M. Imani, Y. Kim, and T. Rosing, “Mpim: Multi-purpose in-memory
processing using configurable resistive memory,” in ASP-DAC. IEEE,
2017, pp. 757–763.
[8] T. Tang et al., “Binary convolutional neural network on rram,” in 22nd
ASP-DAC. IEEE, 2017, pp. 782–787.
[9] X. Fong et al., “Spin-transfer torque devices for logic and memory:
Prospects and perspectives,” IEEE TCAD, vol. 35, 2016.
[10] Z. He et al., “Exploring stt-mram based in-memory computing paradigm
with application of image edge extraction,” in ICCD. IEEE, 2017, pp.
439–446.
[11] L. Cavigelli et al., “Accelerating real-time embedded scene labeling with
convolutional networks,” in DAC, 2015 52nd ACM/EDAC/IEEE. IEEE,
2015, pp. 1–6.
[12] S. Angizi et al., “Imce: energy-efficient bit-wise in-memory convolution
engine for deep neural network,” in ASP-DAC. IEEE Press, 2018, pp.
111–116.
[13] S. Angizi, Z. He et al., “Cmp-pim: An energy-efficient comparator-based
processing-in-memory neural network accelerator,” IEEE/ACM Design
Automation Conference (DAC), 2018.
[14] A. Roohi and R. F. DeMara, “Nv-clustering: Normally-off computing
using non-volatile datapaths,” IEEE Transactions on Computers, vol. 67,
no. 7, pp. 949–959, July 2018.
[15] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and anal-
ysis of approximate compressors for multiplication,” IEEE Transactions
on Computers, vol. 64, no. 4, pp. 984–994, 2015.
[16] Y. Netzer et al., “Reading digits in natural images with unsupervised
feature learning,” in NIPS workshop on deep learning and unsupervised
feature learning, vol. 2011, no. 2, 2011, p. 5.
[17] M. Abadi et al., “Tensorflow: Large-scale machine learning on hetero-
geneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
[18] (2011) Ncsu eda freepdk45. [Online]. Available: http://www.eda.ncsu.
edu/wiki/FreePDK45:Contents
[19] G. Panagopoulos et al., “A framework for simulating hybrid mtj/cmos
circuits: Atoms to system approach,” in DATE, 2012.
[20] X. Dong et al., “Nvsim: A circuit-level performance, energy, and
area model for emerging non-volatile memory,” in Emerging Memory
Technologies. Springer, 2014, pp. 15–50.
[21] R. Andri et al., “Yodann: An ultra-low power convolutional neural
network accelerator based on binary weights,” in ISVLSI. IEEE, 2016,
pp. 236–241.
[22] S. D. C. P. V. . Synopsys, Inc.
[23] K. Chen et al., “Cacti-3dd: Architecture-level modeling for 3d die-
stacked dram main memory,” in DATE, 2012. IEEE, 2012, pp. 33–38.
