Efficient Implementation of Multi-Channel Convolution in Monolithic 3D
  ReRAM Crossbar by Ko, Sho et al.
Efficient Implementation of Multi-Channel
Convolution in Monolithic 3D ReRAM Crossbar
Sho Ko
School of ECE
Georgia Tech
Atlanta, GA, USA
sko.45@gatech.edu
Yun Joon Soh
Department of CSE
UC San Diego
La Jolla, CA, USA
yjsoh@eng.ucsd.edu
Jishen Zhao
Department of CSE
UC San Diego
La Jolla, CA, USA
jzhao@eng.ucsd.edu
Abstract—Convolutional neural networks (CNNs) demonstrate
promising accuracy in a wide range of applications. Among all
layers in CNNs, convolution layers are the most computation-
intensive and consume the most energy. As the maturity of
device and fabrication technology, 3D resistive random access
memory (ReRAM) receives substantial attention for accelerating
large vector-matrix multiplication and convolution due to its
high parallelism and energy efficiency benefits. However, imple-
menting multi-channel convolution naively in 3D ReRAM will
either produce incorrect results or exploit only partial parallelism
of 3D ReRAM. In this paper, we propose a 3D ReRAM-
based convolution accelerator architecture, which efficiently maps
multi-channel convolution to monolithic 3D ReRAM. Our design
has two key principles. First, we exploit the intertwined structure
of 3D ReRAM to implement multi-channel convolution by using
a state-of-the-art convolution algorithm. Second, we propose
a new approach to efficiently implement negative weights by
separating them from non-negative weights using configurable
interconnects. Our evaluation demonstrates that our mapping
scheme in 16-layer 3D ReRAM achieves a speedup of 5.79×,
927.81×, and 36.8× compared with a custom 2D ReRAM baseline
and state-of-the-art CPU and GPU. Our design also reduces
energy consumption by 2.12×, 1802.64×, and 114.1× compared
with the same baseline.
Index Terms—convolutional neural network (CNN), 3D resis-
tive random access memory (ReRAM), mapping, accelerator.
I. INTRODUCTION
Deep learning algorithms are adopted in a wide range of
systems, whether small edge devices or large data centers [1].
Convolutional neural networks (CNNs) have revolutionized
deep learning applications by achieving unprecedented accu-
racy for object detection and image classification. However,
CNNs are time-consuming and power-hungry during the com-
putation process. For example, AlexNet [16] performs 109
operations for a single image input without batching [4].
Convolution layers are the most computation-demanding in
CNNs. It is estimated that the convolution layers of VGG-
16 [14] take 67.8% of the total execution time [19].
Recently, resistive random access memory (ReRAM) is
becoming an attractive technology solution for accelerating
convolution layers, due to its promising parallelism and energy
efficiency benefits. ReRAM is a novel memory technology
which consists of a crossbar structure of memristors. It
combines storage and computation together and accelerates
Fig. 1. Horizontally integrated monolithic 3D ReRAM.
deep neural networks in the analog domain. Recently, sev-
eral ReRAM-based processing-in-memory (PIM) accelerators
have been proposed such as PRIME [2], ISAAC [3], and
PipeLayer [4]. These architectures all focused on efficiently ar-
chitecting 2D ReRAM for CNN applications. However, mono-
lithic 3D integration technology has grown rapidly. Compared
with 2D ReRAM, 3D ReRAM can provide more parallelism,
take less area, produce less noise, and consume less energy
in computations [7]. Monolithic 3D ReRAM can be either
vertically integrated or horizontally integrated. In our work, we
focus on mapping multi-channel convolution to horizontally
integrated monolithic 3D ReRAM because it can be more
reliably fabricated [7], as shown in Fig. 1.
Nonetheless, leveraging 3D ReRAM for processing multi-
channel convolution in parallel still faces three challenges.
First, even though previous works have successfully designed
several accelerators which used 2D ReRAM to process multi-
channel convolution, simply extending 2D ReRAM to 3D
ReRAM without any modification will produce incorrect
results due to the stacked structure. Second, even multi-
channel convolution can be correctly implemented, a naive
implementation will exploit only partial parallelism of 3D
ReRAM. Third, kernels in multi-channel convolution, like
edge detection filters, sometimes has negative weights. An
efficient way to implement negative weights in 3D ReRAM
is necessary.
Our goal in this paper is to efficiently map multi-channel
convolution to horizontally integrated monolithic 3D ReRAM.
ar
X
iv
:2
00
4.
00
24
3v
1 
 [c
s.A
R]
  1
 A
pr
 20
20
In order to achieve our goal, we propose an convolution ac-
celerator with two design principles. First, to solve challenges
1 and 2, we for the first time exploit the massive parallelism
of 3D ReRAM to accelerate CNNs by using a newly pro-
posed algorithm to implement multi-channel convolution [9].
Second, to solve challenge 3, we propose a new approach to
efficiently implement negative weights in 3D ReRAM using
configurable interconnects.
II. BACKGROUND
In this section, we describe ReRAM background and moti-
vate our design.
A. Convolutional Neural Networks
CNNs are the heart of current deep learning applications. A
typical CNN consists of multiple layers, such as convolution
layers, pooling layers, and fully-connected layers, as shown in
Fig. 2.
Fig. 2. Convolutional neural network.
B. Memristor and 2D ReRAM
2D ReRAM is a grid structure consisting of multiple mem-
ristors, as shown in Fig. 3. Each ReRAM cell contains one
memristor. Such design can exploit the analog characteris-
tics of ReRAM to perform fast and energy-efficient matrix
multiplication and convolution. Vector-matrix multiplication
can be easily calculated using ReRAM, because of two basic
electrical theorems, Ohm’s law and Kirchhoff’s current law.
Ohm’s law states that the current through a resistor is equal
to the voltage across the resistor divided by the resistance
of the resistor (I = V/R), which is also equal to the voltage
across the resistor multiplied by the conductance of the resistor
(I = V G). This law makes performing analog floating-point
multiplication possible. Kirchhoff’s current law states that the
total current output is equal to the sum of all input current for a
node in the circuit. This law makes performing analog floating-
point addition possible. Vector-matrix multiplication can be
mapped to ReRAM in the following three steps, as shown in
Fig. 3: First, the digital input is converted to analog signals
by digital-to-analog converters (DACs) and then mapped to
the voltage on horizontal bit lines (WLs); Second, the weight
matrix is quantized and then mapped to the conductance of
memristors; Third, the output signals are read from the current
on the vertical bit lines (BLs) and then converted to digital
output by analog-to-digital converters (ADCs).
C. Monolithic 3D ReRAM
Monolithic 3D ReRAM integrates ReRAM cells in either a
vertical or a horizontal manner. For example, B. Chakrabarti
et al. developed an 8-layer vertically integrated monolithic
3D ReRAM [6], while M. Mao et al. designed a horizontally
integrated monolithic 3D ReRAM [7]. Horizontally integrated
monolithic 3D ReRAM has more reliable manufacturing tech-
nology [7]. An example of horizontal 3D ReRAM is shown
in Fig. 1. Its intertwined structure ensures that WLs and BLs
between adjacent layers are shared. The 4-layer 3D ReRAM
has three voltage planes with three WLs on each plane. It also
has two current planes with three BLs on each plane. Different
layers of memristors have different colors.
Such 3D structure has several advantages compared with
the 2D structure. First, 3D ReRAM has less area than the
2D version for the same amount of memristors. Second, 3D
ReRAM has shorter WLs and BLs to avoid parasitic resistance
which may introduce unnecessary noise in the circuit and
compromise the output integrity [5]. Third, shared WLs and
BLs between adjacent layers in 3D ReRAM lead to better
utilization of peripheral circuits to save space and energy [8].
Finally, shared BLs connect two adjacent layers of memristors.
According to Kirchhoff’s current law, the current on the BL is
equal to the sum of the current from the two adjacent layers.
It is represented as
I = VaboveGabove + VbelowGbelow (1)
in Fig. 1. This property is helpful for mapping convolu-
tion to 3D ReRAM and maximize computational parallelism.
Motivated by Y. Huang et al., who leveraged the massive
parallelism of monolithic 3D ReRAM for graph processing
algorithms [8], we design a convolution accelerator which uses
the same structure for multi-channel convolution.
III. PROPOSED APPROACH
In order to exploit the massive parallelism provided
by monolithic 3D ReRAM to implement the computation-
intensive multi-channel convolution layers in CNNs, we pro-
pose a convolution accelerator, which efficiently maps multi-
channel convolution to monolithic 3D ReRAM.
A. Architecture Design
We employ an optimized architecture for accelerating
CNNs, as shown in Fig. 4. The architecture is composed
of multiple tiles of ReRAM cells connected by an on-chip
mesh [3]. Each tile has a eDRAM buffer, a shared bus, a
controller, and multiple ReRAM-based processing engines.
We substitute the conventional 2D ReRAM crossbar with
monolithic 3D ReRAM crossbar. Each processing engine com-
municates with the buffer via the shared bus. The controller
Fig. 3. 2D ReRAM crossbar for vector-matrix multiplication.
Fig. 4. Overall architecture.
maps multi-channel convolution to the processing engines and
helps configure interconnects.
B. Convolution Algorithm
Common 2D convolution algorithms include single ker-
nel single channel (SKSC), single kernel multiple channel
(SKMC), and multiple kernel multiple channel (MKMC).
MKMC is widely used in many CNN architectures. In order to
present how MKMC without batching works, we define image
to be I and kernel to be K. I is a 3D matrix with dimensions
c(channel)× h(height)×w(width). K is a 4D matrix with
dimensions n(kernel)× c(channel)× l(length)× l(length).
SKSC is a simple convolution between one channel of the
image and one channel of one kernel, which is defined as
SKSC(Ii,Kj,i) = conv(Ii,Kj,i) (2)
where i ∈ [0, c) and j ∈ [0, n). SKMC is calculated by
summing the result of SKSC of every corresponding channel
of the image and one kernel, which is defined as
SKMC(I,Kj) =
c−1∑
i=0
conv(Ii,Kj,i) (3)
where j ∈ [0, n). MKMC is computed by concatenating the
result of SKMC of the image and every kernel, which is
defined as
MKMC(I,K) = SKMC(I,K0)| · · · |SKMC(I,Kn−1)
(4)
where | represents concatenation.
Traditionally, MKMC is calculated by unrolling each kernel
into a row vector in the kernel matrix and corresponding
image pixels to a column vector in the image matrix. Then
multiplying the two matrices gives the result. However, this
approach is not suitable for mapping to 3D ReRAM because
the property in equation (1) cannot be easily applied. Recently,
new approaches to compute MKMC have been proposed [9].
One of them is to compute the n2 convolution using n2
different 1×1 convolutions. It unrolls the corresponding 1×1
weights in all channels within one kernel into a row vector
in the kernel matrix and the corresponding 1 × 1 pixels in
all channels within the image into a column vector. After
multiplying the two matrices, there are l2 submatrices of
size n × (h × w). We need to superimpose the submatrices
together into one matrix and reshape it to be h × w × n, as
shown in Fig. 5. This algorithm is suitable for mapping to
3D ReRAM. In particular, the superimposition step can be
efficiently implemented using the property in equation (1).
Fig. 5. Computation process of MKMC [9].
C. Efficient Mapping of Convolution to 3D ReRAM
We design an efficient mapping of the algorithm to 3D
ReRAM. We start with mapping the n× c× l× l kernel to 3D
ReRAM. We employ c WLs in each voltage plane and n BLs
in each voltage plane. Since we use 3D ReRAM with shared
WLs and BLs, the number of layers has to be an even number
for reconfiguration. If l2 is an even number, we use l2 layers of
memristors, l
2
2 +1 voltage planes, and
l2
2 current planes. If l
2
is an odd number, we use l2+1 layers of memristors, l
2+1
2 +1
voltage planes, and l
2+1
2 current planes. Note when l
2 is odd,
one layer of memristors is not in use (dummy layer). We need
to either set the resistance of the memristors close to zero or
set the voltage on the relevant WL to zero to ensure correct
output current. For each voltage plane, c WLs correspond to
one column of the image matrix. WLs from different voltage
planes but on the same vertical plane have the same voltage
to maximize parallelism of the 3D structure. One column of
the image matrix can be fed into 3D ReRAM at one logical
cycle. It takes h×w logical cycles to pass the c×h×w image
into 3D ReRAM. For each current plane, n BLs correspond to
n kernels in the output. BLs from different current planes but
on the same vertical plane are accumulated simultaneously to
implement the superimposition. 3D ReRAM produces n sums
at one logical cycle. After h×w logical cycles, it will produce
the n× h× w output.
In addition, we present a new approach to implement
Fig. 6. Flow diagram of the full mapping scheme.
negative weights using configurable interconnects. The full
mapping scheme is summarized in the flow diagram in Fig. 6.
First, we scan each of n kernels and count the number of
negative weights and non-negative weights. Note there is
a voltage plane that separates negative weights from non-
negative weights for each kernel. Second, negative weights
are mapped to the lower layers below the voltage separation
plane and non-negative weights are mapped to the upper layers
above the voltage separation plane. Third, interconnects are
configured for negative weights and non-negative weights ac-
cordingly. Fourth, negative weights and non-negative weights
are accumulated separately and then fed into peripheral cir-
cuits which are used to read the difference between the two
accumulated currents.
D. Putting It All Together: An Example
We demonstrate our approach using an example of applying
an edge detection filter to an image with three channels. The
filter has two kernels each with three channels of the same
value, as shown in Fig. 7(a)-(b). We use a 10-layer 3D ReRAM
(0 to 9) with six voltage planes (0 to 5) and five current planes
(0 to 4). For kernel 0, we set the voltage on voltage plane 5 to
zero because we do not use memristors in layer 9. We use four
layers (0 to 3) for negative weights and five layers (4 to 8) for
non-negative weights. The separation plane is voltage plane
2. After configuring interconnects, we accumulate two current
planes (0 to 1) for negative weights as In and three current
planes (2 to 4) for non-negative weights as Ip, as shown in
Fig. 7(c). For kernel 1, we set the voltage on voltage plane 0
to zero because we do not use memristors in layer 0. We use
one layer (1) for negative weights and eight layers (2 to 9)
for non-negative layers. The separation plane is voltage plane
1. After configuring interconnects, we accumulate one current
plane (0) for negative weights as In and four current planes
(1 to 4) for non-negative weights as Ip, as shown in Fig. 7(d).
In order to read the current difference, we slightly modify
the typical inverting operational amplifier circuit, as shown in
Fig. 7(e). Measuring the output current I2 gives the difference
between Ip and In.
We prove the correctness of the circuit in Fig. 7(e). Since
the current into the negative input of the op amp is zero,
Kirchhoff’s current law gives I0 = In. Then Ohm’s law states
V0 = InR0. Using Kirchhoff’s voltage law, V1 = −InR0
holds, and then I1 = −In is true according to Ohm’s law
again. Finally, we reach the conclusion that I2 = Ip−In after
applying Kirchhoff’s current law again.
IV. EVALUATION
In this section, we evaluate the performance and energy
consumption of our proposed design.
We first compare the performance of 2D ReRAM with other
popular memory technologies such as SRAM, eDRAM, PCM,
and STT-RAM. We use DESTINY tool [10] to simulate 1 GB
of each type of memory using 32 nm technology. Table I shows
the parameters of each memory type. It is easy to observe that
ReRAM performs better in read/write energy and read/write
Fig. 7. (a) One channel of kernel 0. (b) One channel of kernel 1. (c)
Interconnects configuration for kernel 0. (d) Interconnects configuration for
kernel 1. (e) Circuit diagram of inverting operational amplifier.
latency than eDRAM and SRAM. In addition, compared with
STT-RAM, ReRAM has smaller read/write energy and read
latency at the expense of larger write latency.
We then explore monolithic 3D ReRAM by evaluating
the relationship between number of layers in 3D ReRAM
and its performance. Again, we use DESTINY tool [10] to
simulate 1 GB of 3D ReRAM using 32 nm technology. We
use the read/write latency and energy of 2-layer 3D ReRAM
as baseline and normalize 3D ReRAM with more layers to
it, as shown in Fig. 8. We observe that for the same memory
capacity, as the number of layers increases, read/write energy
and read/write latency also increase. In our experiment, we
use profiling results to optimize the number of layers in 3D
ReRAM to balance between more parallelism versus higher
read/write latency and energy.
TABLE I: Parameters of several memory types.
Write Read Write Read
Energy Energy Latency Latency
(nJ) (nJ) (ns) (ns)
ReRAM 1.907 1.623 15.274 13.948
eDRAM 3.407 3.324 34.207 66.661
SRAM 6.687 6.688 144.556 279.546
STT-RAM 2.102 1.975 13.469 18.06
A. Experiment Setup
Configuration and Simulation. In our experiment, we use
3D ReRAM with 16 layers for two reasons. First, 16 layers
are enough to handle a typical kernel size 3×3. Second, it
provides the optimal latency based on the extended report of
DESTINY [12]. We sacrifice the parallelism to support larger
kernels, such as 5×5. If we had smaller number of layers
such as 10 or 12, we must repeat the computation more than
twice to support the larger kernel. For ReRAM crossbars,
we use DESTINY [10] to measure the execution time and
energy consumption. For interconnects, we model with CACTI
6.5 [11] at 32 nm. For DACs and ADCs, we use the results
from B. Murmann, “ADC Performance Survey” [13].
Workload and Baseline. We benchmark several selected
MKMC layers from the inference phase of three popu-
lar CNN architectures, VGG-16 [14], GoogLeNet [15], and
AlexNet [16]. All three architectures are trained in Tensorflow
framework [18] and evaluated on the widely used ImageNet
database [20]. For our baseline, we do not use experimental
results from previous works for two reasons. First, it is unfair
to compare the performance with different deign focuses. Sec-
ond, it is difficult to obtain all the detailed design parameters
from previous papers.
Instead, we compare the execution time and energy con-
sumption of our design with a custom 2D ReRAM baseline,
a CPU platform, and a GPU platform. We implement MKMC
using this algorithm [9] for the CPU and GPU platform. For
the custom 2D ReRAM baseline, we assume 2D ReRAM
crossbars in the same architecture with same amount of mem-
ristors as our proposed 3D ReRAM design for fair comparison.
For the CPU platform, we choose Intel Core i7-5700HQ
processor, which has 4 cores, 6 MB cache, and operates
around 2.7 GHz. For the GPU platform, we choose NVIDIA
GeForce GTX 1080 Ti, which has 3584 CUDA cores, 11
GB GDDR5X graphics memory, and operates around 1582
MHz. The CPU and GPU execution time is measured within
the framework. The CPU energy consumption is estimated
by Intel Product Specifications [17] and the GPU energy
consumption is estimated by NVIDIA System Management
Interface (nvidia-smi).
5 10 15 20 25 30
1
1.5
2
2.5
# of layers in 3D ReRAM
Write Energy
Read Energy
Write Latency
Read Latency
Fig. 8. Normalized read/write latency and energy for monolithic 3D
ReRAM with different number of layers.
B. Performance Results
Fig. 9(a) compares the performance of 3D ReRAM with
a custom 2D ReRAM baseline, a CPU platform, and a GPU
platform. We use the CPU performance as the baseline and
normalize GPU, 2D ReRAM, and 3D ReRAM to it. The
speedup of 3D ReRAM compared with 2D ReRAM, CPU,
and GPU are 5.79×, 927.81×, and 36.8×, respectively. 3D
ReRAM achieves the same inference accuracy as our baseline.
Although 3D ReRAM has slightly larger read/write latency
than 2D ReRAM, the massive parallelism that 3D ReRAM
provides compensates this disadvantage and computes multi-
channel convolution faster. In addition, 2D ReRAM doesn’t
have shared WLs and BLs, resulting in more complex in-
terconnects and longer computation time. 3D ReRAM also
achieves significant speedup compared with CPU and GPU
because convolution layers require intensive computations and
easily tie up digital processors with constant memory access
and data movement.
C. Energy Results
Fig. 9(b) compares the energy consumption of 3D ReRAM
with a custom 2D ReRAM baseline, a CPU platform, and a
GPU platform. We use the CPU energy consumption as the
baseline and normalize GPU, 2D ReRAM, and 3D ReRAM
to it. The energy saving of 3D ReRAM compared with 2D
ReRAM, CPU, and GPU are 2.12×, 1802.64×, and 114.1×,
respectively. 3D ReRAM consumes less energy compared to
2D ReRAM because shared WLs and BLs reduce roughly
half digital-to-analog and analog-to-digital computations. It
also benefits from less complex interconnects. 3D ReRAM
can achieve huge energy reduction compared to CPU and
GPU due to two reasons. First, 3D ReRAM uses the analog
properties to compute vector-matrix multiplication, which is
more energy efficient than most digital computations. Second,
3D ReRAM passes data through stacked layers, which is
shorter than the data movement between processing units and
memory hierarchy in most digital processors.
V. RELATED WORKS
Recently, several architectures have been proposed to
use ReRAM to accelerate CNN applications. PRIME [2],
ISAAC [3], and PipeLayer [4] demonstrate the promising
performance gain when off-loading the CNN computation to
2D ReRAM crossbar. Our paper contributes in similar but yet
different aspects. First, 2D ReRAM has limited parallelism in
computation, while our work extends the structure to 3D with
shared WLs and BLs to fully exploit its computational capa-
bility to process multi-channel convolution layers. However,
since our design focus is different from previous works, we
cannot use their works as our baseline. Instead, we compare
our design with a custom 2D ReRAM baseline and state-
of-the-art CPU and GPU. Second, kernel mapping is not
efficiently addressed in PRIME, while we present a more
efficient mapping based on a recently proposed approach to
compute MKMC [9].
VI. CONCLUSION
This paper presents a convolution accelerator which effi-
ciently maps multiple kernel multiple channel convolution to
monolithic 3D ReRAM. By using a newly proposed algo-
rithm, we for the first time take advantage of the property
in equation (1) and maximize parallelism of 3D ReRAM to
improve the performance and energy efficiency of convolution
layers in convolutional neural networks. In order to implement
Fig. 9. (a) Normalized 3D ReRAM speedup against 2D ReRAM, CPU, and GPU. (b) Normalized 3D ReRAM energy saving against 2D ReRAM, CPU, and
GPU.
negative weights, we present a new approach to accumulate
negative weights and non-negative weights separately using
configurable interconnects and calculate the final results us-
ing peripheral circuits. Our experiment demonstrates that the
proposed mapping achieves a speedup of 5.79×, 927.81×,
and 36.8× compared with a custom 2D ReRAM baseline and
state-of-the-art CPU and GPU. Our design also reduces energy
consumption by 2.12×, 1802.64×, and 114.1× compared with
the same baseline.
REFERENCES
[1] T. Chen et al., “DianNao: A Small-Footprint High-Throughput Ac-
celerator for Ubiquitous Machine-Learning,” Architectural Support for
Programming Languages and Operating Systems, vol. 49, no. 4, pp.
269-284, 2014.
[2] P. Chi et al., “PRIME: A Novel Processing-in-memory Architecture
for Neural Network Computation in ReRAM-based Main Memory,”
International Symposium on Computer Architecture, vol. 44, no. 3, pp.
27-39, 2016.
[3] A. Shafiee et al., “ISAAC: A Convolutional Neural Network Accelerator
with In-Situ Analog Arithmetic in Crossbars,” International Symposium
on Computer Architecture, vol. 44 no. 3, pp. 14-26, 2016.
[4] L. Song, X. Qian, H. Li, and Y. Chen, “PipeLayer: A Pipelined ReRAM-
Based Accelerator for Deep Learning,” International Symposium on
High Performance Computer Architecture, pp. 541-552, 2017.
[5] L. Xia et al., “Technological Exploration of RRAM Crossbar Array
for Matrix-Vector Multiplication,” Journal of Computer Science and
Technology, vol. 31, no. 1, pp. 3-19, 2016.
[6] B. Chakrabarti et al., “A multiply-add engine with monolithically inte-
grated 3D memristor crossbar/CMOS hybrid circuit,” Scientific reports
7, pp. 42429, 2017.
[7] M. Mao, S. Yu, and C. Chakrabarti, “Design and Analysis of Energy-
Efficient and Reliable 3-D ReRAM Cross-Point Array System,” IEEE
Transactions on Very Large Scale Integration Systems, vol. 26, no. 7,
pp. 1290-1300, 2018.
[8] Y. Huang et al., “RAGra: Leveraging Monolithic 3D ReRAM for
Massively-Parallel Graph Processing,” Design, Automation & Test in
Europe Conference & Exhibition, pp. 1273-1276, 2019.
[9] A. Anderson, A. Vasudevan, C. Keane, and D. Gregg, “Low-memory
GEMM-based convolution algorithms for deep neural networks,” CoRR,
abs/1709.03395, 2017.
[10] M. Poremba, S. Mittal, D. Li, J.S. Vetter, and Y. Xie, “DESTINY: A
Tool for Modeling Emerging 3D NVM and eDRAM caches,” Design,
Automation & Test in Europe Conference & Exhibition, pp. 1543-1546,
2015.
[11] “CACTI,” http://www.hpl.hp.com/research/cacti/.
[12] S. Mittal, M. Poremba, J.S. Vetter, and Y. Xie, “Exploring Design Space
of 3D NVM and eDRAM Caches Using DESTINY Tool,” Oak Ridge
National Laboratory, USA, Tech. Rep. ORNL/TM-2014/636, 2014.
[13] B. Murmann, “ADC Performance Survey 1997-2016,”
http://web.stanford.edu/∼murmann/adcsurvey.html.
[14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
[15] C. Szegedy et al., “Going Deeper with Convolutions”, Computer Vision
and Pattern Recognition, pp. 1-9, 2015.
[16] A. Krizhevsky, I. Sutskever, and, G.E. Hinton, “Imagenet Classification
with Deep Convolutional Neural Networks”, Advances in neural infor-
mation processing systems, pp. 1097-1105, 2012.
[17] “Intel Core i7-5700HQ Processor,”
https://ark.intel.com/content/www/us/en/ark/products/87716/
intel-core-i7-5700hq-processor-6m-cache-up-to-3-50-ghz.html.
[18] M. Abadi et al., “Tensorflow: A System for Large-Scale Machine
Learning,” USENIX Symposium on Operating Systems Design and
Implementation, vol. 16, pp. 265-283, 2016.
[19] R. Girshick, “Fast R-CNN,” International Conference on Computer
Vision, pp. 1440-1448, 2015.
[20] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition
Challenge,” International Journal of Computer Vision, vol. 115, no. 3,
pp. 211-252, 2015.
