SMART Paths for Latency Reduction in ReRAM Processing-In-Memory
  Architecture for CNN Inference by Ko, Sho & Yu, Shimeng
SMART Paths for Latency Reduction in ReRAM
Processing-In-Memory Architecture for CNN
Inference
Sho Ko
School of ECE
Georgia Tech
Atlanta, GA, USA
sko.45@gatech.edu
Shimeng Yu
School of ECE
Georgia Tech
Atlanta, GA, USA
shimeng.yu@ece.gatech.edu
Abstract—This research work proposes a design of an analog
ReRAM-based PIM (processing-in-memory) architecture for fast
and efficient CNN (convolutional neural network) inference. For
the overall architecture, we use the basic hardware hierarchy
such as node, tile, core, and subarray. On the top of that, we
design intra-layer pipelining, inter-layer pipelining, and batch
pipelining to exploit parallelism in the architecture and increase
overall throughput for the inference of an input image stream.
We also optimize the performance of the NoC (network-on-
chip) routers by decreasing hop counts using SMART (single-
cycle multi-hop asynchronous repeated traversal) flow control.
Finally, we experiment with weight replications for different
CNN layers in VGG (A-E) for large-scale data set ImageNet.
In our simulation, we achieve 40.4027 TOPS (tera-operations per
second) for the best-case performance, which corresponds to over
1029 FPS (frames per second). We also achieve 3.5914 TOPS/W
(tera-operaions per second per watt) for the best-case energy
efficiency. In addition, the architecture with aggressive pipelining
and weight replications can achieve 14× speedup compared to
the baseline architecture with basic pipelining, and SMART flow
control achieves 1.08× speedup in this architecture compared to
the baseline. Last but not least, we also evaluate the performance
of SMART flow control using synthetic traffic.
Index Terms—hardware accelerator, ReRAM (resistive random
access memory), PIM (processing-in-memory), CNN (convolu-
tional neural network), NoC (network-on-chip), SMART flow
control
I. INTRODUCTION
Recently, artificial intelligence (AI) and machine learning
(ML) techniques have become more and more mature. AI/ML
gradually become an indispensable part of our life. Neural
network (NN) is an important category of AI/ML that mimics
how human thinks. Neuromorphic computing has gained more
and more popularity in both academia and industry. Specifi-
cally, CNN is one type of NN that has revolutionized deep
learning applications by achieving unprecedented accuracy in
computer vision and pattern recognition applications.
From the software perspective, the data flow in a CNN
is clear and simple. Programming a CNN in a certain deep
learning framework can be easily done. However, from the
hardware perspective, CNN is extremely computation intensive
and power hungry. The reason is that CNN requires billions of
MAC (multiply and accumulate) operations, which quickly ties
up a conventional CPU or even a GPU. Therefore, the circuit
and architecture research communities gradually focus on
designing efficient digital accelerator for CNNs. The building
blocks of these accelerators are still transistors and logic gates.
Researchers are trying to use the basic modules to design
digital circuit and computer architecture that compute CNNs
most efficiently. However, as the digital processor reaches the
upper limit of its computing capability, researchers are looking
for the next-generation hardware solution. Recently, there
have been some emerging eNVMs (embedded non-volatile
memories) designed from the device perspective. As these
emerging technologies become more and more mature, people
are researching on using them to build efficient PIM circuit
and architecture for processing CNNs. Our paper presents an
efficient ReRAM-based PIM architecture for CNN inference.
The contributions are summarized in the following three bullet
points.
1) We utilize one type of eNVM, ReRAM (resistive random
access memory) to design a PIM architecture for fast and
efficient CNN inference.
2) We optimize the architecture from both the processing
side and the interconnect side. From the processing side,
we design intra-layer pipelining, inter-layer pipelining,
and batch pipelining to exploit parallelism in the archi-
tecture and increase overall throughput for the inference
of an input image stream. From the interconnect side,
we maximizes the performance of the NoC routers by
decreasing hop counts using SMART flow control. In
addition, we experiment with weight replications to
further accelerate the architecture.
3) To report throughput and energy efficiency, we run the
cycle-accurate simulation for the processing side by
building a C++ simulator from scratch, and we use
the garnet2.0 simulator for the interconnect side. Our
design achieves great speedup compared to the baselines.
Independently, we also evaluate the performance of
SMART flow control using synthetic traffic in garnet2.0.
ar
X
iv
:2
00
4.
04
86
5v
1 
 [c
s.A
R]
  1
0 A
pr
 20
20
II. BACKGROUND
A. CNN Training and Inference
In this section, we present the CNN algorithm implemented
in our design. A CNN has two phases: training and inference.
To start with, the weights in a CNN are initialized with
random values. Then the training process will update and
refine the weights to a specific data set. Finally, a well-
trained a CNN can be used for inference of new images.
Typically, the training process has much more power and
time consumption than the inference process because training
requires forward propagation, back propagation, and weight
update while inference only requires forward propagation.
However, a CNN needs to be trained only once, and then it can
be used for inference for many times. For the inference process
of a CNN, it consists of multiple layers with three basic types:
convolution layers, pooling layers, and fully-connected layers,
as shown in Fig. 1. In this paper, we focus on the inference
process of a CNN.
Fig. 1. CNN Inference.
B. ReRAM Device
From the device perspective, resistance-based eNVMs have
become more and more mature and manufacturable. Tech-
nologies such as ReRAM [1], PCM (phase change memory)
[2], and STT-MRAM (spin-transfer torque magnetic random
access memory) [3] start to gain more and more popularity.
These eNVMs have much smaller cell size than SRAM. They
can also achieve MLC (multiple bits per cell). Therefore, they
can be utilized to map the entire weights on-chip at once and
eliminate off-chip accesses. They are also non-volatile and
CMOS-process compatible. Their access speed is within 10
ns, which is in the same magnitude as SRAM.
C. Analog PIM Circuit
From the circuit perspective, 2D ReRAM is a grid structure
consisting of multiple ReRAM cells, as shown in Fig. 2.
Such design can exploit the analog characteristics of ReRAM
to perform fast and energy-efficient matrix multiplication
and convolution. Vector-matrix multiplication can be easily
calculated using ReRAM, because of two basic electrical
theorems, Ohm’s law and Kirchhoff’s current law. Ohm’s
law states that the current through a resistor is equal to
the voltage across the resistor divided by the resistance of
the resistor (I = V/R), which is also equal to the voltage
across the resistor multiplied by the conductance of the resistor
(I = V G). This law makes performing analog floating-point
multiplication possible. Kirchhoff’s current law states that the
total current output is equal to the sum of all input current
for a node in the circuit. This law makes performing analog
Fig. 2. ReRAM Array with Peripheral Circuits.
floating-point addition possible. Vector-matrix multiplication
can be mapped to ReRAM in the following three steps. First,
the digital input is converted to analog signals by DACs
(digital-to-analog converters) and then mapped to the voltage
on horizontal WLs (word lines); Second, the weight matrix is
quantized and then mapped to the conductance of ReRAM
cells; Third, the analog output signals are read from the
current on the vertical BLs (bit lines), stored in sample &
hold units, converted to digital output by ADCs (analog-to-
digital converters), and some columns are shifted and added
together to produce the final results.
D. PIM Architecture
Recently, several ReRAM-based PIM architectures have
been presented for CNN inference, such as PUMA [4], ISAAC
[5], and PRIME [6].
PUMA creates its own compiler and domain-specific ISAs
(instruction set architectures) to make the architecture more
general-purpose, programmable, and reconfigurable. It’s a spa-
tial architecture in which each tile is executing its own ISAs
simultaneously with all other tiles. It uses a state machine
to synchronize among different cores, it has a large synchro-
nization overhead. In addition, the penalty of ISA, instruction
decoder, and instruction memory is also large if the workloads
are only CNNs.
Unlike PUMA, ISAAC and PRIME are ASICs specifically
for CNN inference. PRIME is slightly different from ISAAC
in the sense that PRIME stores positive and negative weights
in separate subarrays while ISAAC stores them in the same
subarray and uses a small trick to differentiate based on the
MSB of a positive 2’s complementary number is 0 while the
MSB of a negative 2’s complementary number is 1. Therefore,
PRIME comes with more area and power penalty, which leads
to less area and energy efficiency.
III. OVERALL ARCHITECTURE
The overall chip, also called a node, as shown in Fig. 3. The
node is composed of 16×20 = 320 tiles. Each tile has a outer
associated with it. The routers form a mesh structure. Within
each tile, there are 12 cores, a local memory of 64KB eDRAM,
Fig. 3. Overall Architecture of a Node.
a shift & add unit, an output register of 2KB eDRAM, two
sigmoid units, and a max pooling unit. Within each core, there
are eight ReRAM subarrays of size 128× 128, 128× 8 1-bit
DACs, 128 × 8 sample & hold units, eight ADCs with 8-bit
resolution, four shift & add units, an input register of 2KB
eDRAM, and an output register of 2KB eDRAM. There are
buses within each tile and each core. The number of each
component is designed so that there is no structural hazard
during run time. For our design, the weights and feature
maps are both fixed 16 bits. Lots of previous research has
shown that 16 bits are accurate enough for CNN inference.
We conservatively assume 2-bit MLC for each ReRAM cell.
Therefore, we need eight cells across eight different columns
to encode all of them. In addition, our DAC is of 1-bit
resolution, which is trivial. Since 16-bit DAC has too much
noise and takes too much area and power, we choose to pass in
the 16-bit IFM bit by bit sequentially in 16 cycles. Therefore,
we only need 1-bit DACs. Note that since we partition the
weight spatially across different columns and we also partition
the input temporally within the same column, the shift and add
unit after the ADC will be necessary to produce the correct
final results.
Fig. 4 shows the power and area of each individual com-
ponent, we gather the data from PUMA [4] and ISAAC
[5], both of which are in 32 nm CMOS technology node.
Note that this table shows the power consumption when the
component is functioning. The node has a total area of 124.848
mm2. The total power is 108.26944 W, which is the peak
power consumption assuming every component on the chip is
functioning in every cycle. For each workload, we analyze the
energy efficiency by summing the consumed energy in each
pipeline stage.
IV. EFFICIENT PIPELINING
To better illustrate how intra-layer pipelining, inter-layer
pipelining, and batch pipelining work, we define the IFM
Fig. 4. Power and Area of Each Hardware Component.
of the current CNN layer to be I , the kernel of the current
CNN layer to be K, and the OFM of the current CNN layer
(also IFM of the next layer) to be O. I is a 3D matrix
with dimensions c(channel) × h(height) × w(width). K
is a 4D matrix with dimensions n(kernel) × c(channel) ×
l(length) × l(length). O is a 3D matrix with dimensions
c(kernel)× h(height)× w(width).
A. Intra-layer Pipelining
For the intra-layer pipeline, it takes h×w logical cycles to
pass the entire IFM into this pipeline. Note that that one intra-
layer pipeline processes one pixel from all channels. There
are four different intra-layer pipelines for one CNN layer
depending on whether the layer is mapped to a single tile
or multiple tiles and whether the layer has pooling operations
at the end. Specifically, single-mapped tile without pooling
requires 24 cycles; single-mapped tile with pooling requires 29
cycles; multi-mapped tile without pooling requires 26 cycles;
multi-mapped tile with pooling requires 31 cycles.
B. Inter-layer Pipelining
For inter-layer pipeline, we observe that we don’t need
to wait for the current layer to produce the entire OFM in
order to start the next layer. We only need to wait for enough
information from the current layer that is able to start the first
convolution of the next layer. The number of values in O that
the next layer needs to wait is shown as
valuesWait = (w × (l − 1) + l)× n (1)
and the number of cycles the next layer needs to wait is shown
as
cyclesWait = w × (l − 1) + l (2)
where the kernel strides in the row-majored fashion.
C. Batch Pipelining
For batch pipelining, we observe that we can design a
pipeline that overlaps the latency between input images which
come consecutively at a certain rate. We follow two design
principles to design the batch pipeline. First, there should
Fig. 5. Speedup of Each VGG due to Different Pipelinings.
Fig. 6. Speedup of Each VGG due to Different NoCs.
be no structural hazard, which means in the same cycle, the
pipeline cannot process a specific layer (say layer 1) from two
or more different images. In other words, a specific layer in
a specific cycle can only process one single image. Second,
dependencies between consecutive layer (say layer 1 and layer
2) should be strictly followed for all images. In other ways,
if layer 2 has to wait for 2 cycles after layer 1 starts, all
layer 2 from all images have to wait for 2 cycles after the
corresponding layer 1 starts.
V. SMART FLOW CONTROL
The topology of a NoC describes the connection between
routers via links/channels. NoC topologies include bus, ring,
mesh, torus, flattened butterfly, fully connected and so on. The
most common topology is a 2D mesh because it can be laid out
easily. In our design, the NoC is a 16×20 2D mesh topology.
The routing of a NoC describes the links that a flit takes
from the source router to the destination router. For example,
XY routing means when choosing the routing path from the
source to the destination, the flit always goes horizontal (X
direction) and then vertical direction (Y direction). In addition,
a turn model such as north-last model or east-first model can
be used. It disallows some turns to get rid of deadlocks in the
NoC. In our design, we use XY routing. In addition, we set
the link width to be 128 bits, which is the flit size.
The flow control of a NoC describes when a flit can traverse
to the downstream router or it has to stay at the upstream router
if there is traffic in the NoC. The most common algorithm is
the wormhole flow control. In wormhole, the link is allocated
at the packet level and the buffer is allocated at the flit
level. It significantly improves the performance of virtual cut-
through flow control because its buffer can have flits from
different packets. However, wormhole still suffers from poor
link utilization and results in HoL (head-of-line) blocking.
HoL blocking means if the first flit in the buffer cannot move,
all of the rest flits in the buffer cannot move either. In our
design, we use wormhole as one baseline.
In order to present the total latency in cycles to send a packet
Fig. 7. Weight Replications of Each VGG.
from the source to the destination, we define the wire delay
for one link to be tw, the hop count to be H , the contention
delay to be tc, and the serialization delay to be ts. A typical
formula for latency in cycles is defined as
T = tw ×H + tc + ts (3)
where the bottleneck is the term H . The ideal solution to
reduce H down to 1 is to use the fully connected topology.
However, since it’s nearly impossible to lay out a topology like
this, we resort to smart flow control algorithm which makes
the NoC behave closely to an ideal fully connected topology.
Our NoC model uses SMART flow control from [7] to
reduce NoC latency and increase the overall throughput. Place-
and-route repeated wires can go up to 16 mm in 1 ns in 45
nm technology node. It can go further in the projected 32
nm technology node because wire delay remains constant or
decreases slightly due to technology scaling. Therefore, on-
chip wires can go fast enough to transfer across the chip
within 1 or 2 clock cycles. The high level idea to achieve
smart flow control is to use multiplexers to bypass the routers
on the path from the source to the destination. However, the
bottleneck of SMART happens when two different packets are
sent in the NoC at the same cycle and the two packets share
some common links. Specifically, we need to set two different
priorities for the two paths when setting up the SSRs (setup
requests) to ensure the correct functionality.
VI. EVALUATION
A. Simulators
In the experiment, we run the cycle-accurate simulation
for the processing side by building a C++ simulator from
scratch. We use the cycle-accurate garnet2.0 simulator for the
interconnect side.
B. Workloads and Benchmarks
We use VGG (A-E) [8] for the large-scale data set ImageNet
[9] as our workloads. VGG makes a thorough evaluation of
networks of increasing depth using an architecture with very
small (3×3) convolution filters. Compared to previous CNNs,
VGG improves the accuracy of computer vision and pattern
recognition tasks by a wide margin, which is achieved by
Fig. 8. VGG E Throughput.
Fig. 9. Energy Efficiency of Each VGG.
pushing the CNN depth from a few layers to tens of layers.
There are a total of 3 × 4 × 5 = 60 benchmarks. There are
five different CNNs: VGG (A-E), three different NoCs: ideal,
SMART, wormhole, and four different pipelining scenarios:
without weight replication and without batch pipelining (1),
without weight replication and with batch pipelining (2),
with weight replication and without batch pipelining (3), with
weight replication and with batch pipelining (4).
C. Weight Replications
Pooling layers degrade the performance of inter-layer
pipelining because the next layer has to wait for the pooled
results which come from different columns of the current
feature map. This introduces extra pipeline bubbles, increases
latency, and decreases throughput. In order to have a more
balanced pipeline design, we replicate more weights for the
first few layers while replicate less weights for the deep
layers. Specifically, all five VGGs are down-sampled five
times: 224 × 224, 112 × 112, 56 × 56, 28 × 28, 14 × 14,
7×7. Each time a grid of 2×2 is applied to the whole OFM.
In order to satisfy this trend, we also replicate the weights 16
times, 8 times, 4 times, 2 times, and 1 time. Fig. 7 shows the
number of times the weights are replicated in each layer for
each VGG. All schemes meet the constraint that there are a
maximum of 320 tiles available.
D. Results and Analysis
To explore the effect of different pipelining schemes on the
performance, we use scenario (1) as the baseline and calculate
the speedup of each scenario by normalizing the throughput to
scenario (1). Fig. 5 shows the speedup in all four scenarios for
each VGG in all three different NoCs. The geometric mean
of (2) compared to (1) is 1.0309×, (3) compared to (1) is
10.1788×, and (4) compared to (1) is 13.6903×. Note that
for the best pipelining setup in scenario (4), it achieves a
speedup close to 16×. We don’t need to replicate the weights
in all layers by 16 times to achieve this speedup. Instead, we
replicate weights decreasingly as the layers become deeper
and the size of OFM decreases to make a balanced pipeline
design. Note that the results in Fig. 5 are projected results,
which are not directly from garnet2.0.
To explore the effect of different NoCs on the performance,
we use wormhole as the baseline and calculate the speedup
of all three NoCs (ideal, SMART, wormhole) by normalizing
Fig. 10. Latency Comparison.
Fig. 11. Reception Rate Comparison.
the throughput to wormhole. Fig. 6 shows the speedup in all
three NoCs for each VGG in all four pipelining scenarios. The
geometric mean of ideal compared to wormhole is 1.0809×
and SMART compared to wormhole is 1.0965×. Note that
SMART NoC achieves better speedup for more aggressive
pipelining because more aggressive pipelining has heavier
traffic in the NoC, so the performance of SMART NoC
improves effectively while the performance of wormhole NoC
degrades performance even more. Note that the results in Fig.
6 are projected results, which are not directly from garnet2.0.
The best throughput is achieved when running VGG E. Fig.
8 shows the throughput and the corresponding frame rate of
the architecture when running VGG E in all combinations of
flow controls and pipelining scenarios. We also report the
energy efficiency of the architecture when processing each
VGG, as shown in Fig. 9. Note that weight replications,
batch pipelining, and different flow control algorithms don’t
affect energy efficiency much because with the total amount of
energy consumed depends mostly on the amount of operations
in the workload.
VII. EVALUATION OF SMART FLOW CONTROL USING
SYNTHETIC TRAFFIC
A. Simulation Setup
We use garnet2.0 to evaluate the performance of SMART
flow control compared to wormhole flow control using six
synthetic traffics including uniform random, transpose, tor-
nado, shuffle, neighbor, and bit complement. In addition,
we set up the two flow controls using an 8 × 8 mesh and
XY routing algorithm. For SMART flow control, we assume
HPCmax ≥ 14, since the wire delay for a 1 mm2 chip can
be taken care of within 1 clock cycle [7].
B. Results and Analysis
Fig. 10 shows the injection rate vs latency plot for all
six synthetic traffics. For uniform random, transpose, tornado,
shuffle, and bit complement, wormhole saturates when the
injection rate is around 0.05 while SMART saturates when the
injection rate is around 0.25. For neighbor, wormhole saturates
when the injection rate is around 0.2 while SMART saturates
when the injection rate is around 0.8. It’s obvious that SMART
has higher throughput than wormhole.
Fig. 11 shows the injection rate vs reception rate plot for all
six synthetic traffics. For uniform random, transpose, tornado,
and shuffle, wormhole saturates when the reception rate is
around 0.07 while SMART saturates when the reception rate
is around 0.3. For neighbor, wormhole saturates when the
reception rate is around 0.2 while SMART saturates when the
reception rate is around 0.8. For bit complement, wormhole
saturates when the reception rate is around 0.04 while SMART
saturates when the reception rate is around 0.14. It’s obvious
that SMART has higher reception rate than wormhole.
VIII. CONCLUSION
In this paper, we propose a ReRAM-based PIM architecture
for fast and efficient CNN inference. Our optimizations come
from three aspects. First, we design intra-layer pipelining,
inter-layer pipelining, and batch pipelining to exploit paral-
lelism in the architecture and increase overall throughput.
Second, we optimize the performance of the NoC using
SMART flow control. Third, we leverage weight replications
to maximize parallelism and further accelerate the architecture.
Our simulation shows the different pipelining and weight repli-
cations achieves a speedup of 14× compared to the baselines
from the processing side. SMART flow control achieves a
speedup of 1.08× compared to the baselines from the inter-
connect side. Our evaluation of SMART flow control using
synthetic traffics show that SMART outperforms wormhole by
a wide margin in the communication-heavy synthetic traffics.
Since NoC only represents a small portion of performance
and power within the proposed PIM architecture, SMART
flow control enhances the overall performance by a small
margin while most speedup is achieved from the processing
side by designing efficient pipelining and leveraging weight
replications.
REFERENCES
[1] H.-S. P. Wong, H.-Y. Lee, S. Yu, Y.-S. Chen, Y. Wu, P.-S. Chen, B. Lee,
F.T. Chen, and M.-J. Tsai, “Metaloxide rram,” Proceedings of the IEEE,
vol. 100, no. 6, pp. 19511970, 2012.
[2] G. W. Burr, R. M. Shelby, S. Sidler, C. Di Nolfo, J. Jang, I. Boybat,
R. S. Shenoy, P.Narayanan, K. Virwani, E. U. Giacometti, et al.,
“Experimental demonstration and tolerancing of a large-scale neural
network (165 000 synapses) using phase-change memory as the synaptic
weight element,” IEEE Transactions on Electron Devices, vol. 62, no.
11, pp. 34983507, 2015.
[3] A. F. Vincent, J. Larroque, W. Zhao, N. B. Romdhane, O. Bichler,
C. Gamrat, J.-O.Klein, S. Galdin-Retailleau, and D. Querlioz, “Spin-
transfer torque magnetic memory as a stochastic memristive synapse,” in
2014 IEEE International Symposium on Circuits and Systems (ISCAS),
IEEE, 2014, pp. 10741077.
[4] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S.
Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy, et
al.,“Puma: A programmable ultra-efficient memristor-based accelerator
for machine learning inference,” in Proceedings of the Twenty-Fourth
International Conference on Architectural Support for Programming
Languages and Operating Systems, ACM, 2019, pp. 715731.
[5] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Stra-
chan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional
neural network accelerator with in-situ analog arithmetic in crossbars,”
ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 1426,
2016.
[6] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y.
Xie, “Prime: A novel processing-in-memory architecture for neural
network computation in reram-based main memory,” in ACM SIGARCH
Computer Architecture News, IEEE Press, vol. 44, 2016, pp. 2739.
[7] T. Krishna, C.-H. O. Chen, W. C. Kwon, and L.-S. Peh, “Breaking the
on-chip latency barrier using smart,” in 2013 IEEE 19th International
Symposium on High Performance Computer Architecture (HPCA),
IEEE, 2013, pp. 378389.
[8] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” ArXiv preprint arXiv:1409.1556, 2014.
[9] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z.
Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large
scale visual recognition challenge,” International journal of computer
vision, vol. 115, no. 3, pp. 211252, 2015.
