RED: A ReRAM-based Deconvolution Accelerator by Fan, Zichen et al.
RED: A ReRAM-based Deconvolution Accelerator
Zichen Fan∗§, Ziru Li∗§, Bing Li∗†‡, Yiran Chen†, Hai (Helen) Li†
§ECE Dept., Tsinghua University, Beijing, China
†ECE Dept., Duke University, Durham, NC, USA ‡Army Research Office, Research Triangle Park, USA
§ {fanzc15, lizr15}@mails.tsinghua.edu.cn†{bing.li.ece, yiran.chen, hai.li}@duke.edu
Abstract—Deconvolution has been widespread in neural net-
works. For example, it is essential for performing unsupervised
learning in generative adversarial networks or constructing
fully convolutional networks for semantic segmentation. Resistive
RAM (ReRAM)-based processing-in-memory architecture has
been widely explored in accelerating convolutional computation
and demonstrates good performance. Performing deconvolution
on existing ReRAM-based accelerator designs, however, suffers
from long latency and high energy consumption because deconvo-
lutional computation includes not only convolution but also extra
add-on operations. To realize the more efficient execution for de-
convolution, we analyze its computation requirement and propose
a ReRAM-based accelerator design, namely, RED. More specific,
RED integrates two orthogonal methods, the pixel-wise map-
ping scheme for reducing redundancy caused by zero-inserting
operations and the zero-skipping data flow for increasing the
computation parallelism and therefore improving performance.
Experimental evaluations show that compared to the state-of-
the-art ReRAM-based accelerator, RED can speed up operation
3.69∼31.15× and reduce 8%∼88.36% energy consumption.
I. INTRODUCTION
Generative adversarial networks (GANs) and fully convo-
lutional networks (FCNs) have been widely explored for their
superior performance in processing complicated image tasks.
For example, GANs are used to reconstruct 3D models from
2D images [1] and recover corrupted images [2]. FCNs are
applied in semantic segmentation [3] and object detection [4].
Deconvolution layers are the important for these networks
to carry out the up-sampling from low-resolution to high-
resolution images. As most of the current platforms are mainly
optimized for the regular convolution, the deconvolutional
computation suffers from low-efficiency due to the involved
additional operations. Designing an efficient accelerator cater-
ing to deconvolutional computation is considerable significant
and taken as the focus of this work.
Among the existing neural network accelerator, processing-
in-memory (PIM), which moves the computation close to
and even within the memory elements, demonstrates great
potentials. Resistive RAM (ReRAM) has been taken as a
competitive technology for PIM implementation due to its
low-energy and high-efficient vector-matrix multiplications
performed on the crossbar structure. Various ReRAM-based
accelerators [5–10] have been presented for fast and efficient
∗These three authors contributed equally to this work.
‡This author is supported by the NRC Associate Fellowship Award and is
the corresponding author. bing.li.ece@duke.edu
convolution, showing great advantages of ReRAM over the
CMOS-based counterparts.
Unfortunately, the unique computation patterns of decon-
volution make its implementation on existing ReRAM-based
accelerators very challenging. For example, the common zero-
padding in deconvolution inserts plenty of zeros into input
feature maps before convolution, resulting in massive re-
dundant operations. The padding-free deconvolution excludes
the zero-inserting but involves extra operations, i.e., addition
and cropping after convolution. Though padding-free is more
friendly for the CMOS-based accelerators [11], the add-on
operations leads to the modified circuits on ReRAM-based
accelerator.
This work aims to develop an efficient ReRAM-based
deconvolution accelerator. In the work, we first analyze the
efficiency of zero-padding and padding-free deconvolution
algorithms on existing ReRAM-based platforms. Considering
the inefficiency in performing zero-padding and the high over-
head induced by padding-free, we propose RED, a ReRAM-
based accelerator tailored for deconvolutional computation.
Our approach integrates the optimization on data mapping
and data flow. More specific, the pixel-wise mapping can
dramatically reduce the redundancy caused by zero-inserting
operations, and the zero-skipping data flow further elevates the
computation parallelism without add-on periphery circuits.
We evaluated the power, latency, and area of RED when
performing the deconvolutional layers in GANs and FCNs
and compared with state-of-the-art ReRAM-based accelerators
of zero-padding design and padding-free design [12]. Experi-
mental results show that RED achieve 3.69∼31.15× speedup
and 8%∼88.36% energy consumption reduction, with 22.14%
increment in design area.
This paper is organized as follows. Section II introduces the
background knowledge including ReRAM-based accelerator
and the deconvolutional computation. Section III elaborates
the principle and implementation of RED. In Section IV, we
evaluate RED in terms of power, latency and area and compare
it with state-of-the-art ReRAM-based counterparts. At the end,
we conclude the paper in Section V.
II. PRELIMINARY
A. ReRAM-based CNN Accelerator
As an emerging memory technology, ReRAM crossbar
structure can also effectively execute vector-matrix multipli-
cation operations, which have gained significant attention. As
ar
X
iv
:1
90
7.
02
98
7v
1 
 [c
s.E
T]
  5
 Ju
l 2
01
9
In
p
u
t
Output
(b) Kernel mapping. (c) ReRAM-based PIM core.
Integrated & Fire
Shift Adder
Input: x
Output: y
(a) ReRAM crossbar.
G
lo
b
a
l 
R
o
w
 D
e
c
o
d
e
r
ReRAM
Crossbar
Column Mux
Shift Adder
W
L
/B
L
 D
.
ReRAM
Crossbar
Column Mux
Shift Adder
W
L
/B
L
 D
.
ReRAM
Crossbar
Column Mux
Shift Adder
W
L
/B
L
 D
.
ReRAM
Crossbar
Column Mux
Shift Adder
W
L
/B
L
 D
.
Bank
ControllerGlobal Row Buffer
Buffer Subarray
Fig. 1. ReRAM crossbar, kernel mapping and ReRAM-based PIM.
illustrated in Fig. 1(a), the elements of a weight matrix are
represented as the conductance of ReRAM cells located at the
cross-point of the wordlines and bitlines. During the operation,
an input vector in the form of voltage spikes enters the crossbar
along the wordlines, a.k.a., rows, and the currents flowing out
from the bitlines, a.k.a., columns, denote the computed output
vector of the vector-matrix multiplication. The integrated &
fire circuit converts the output currents to the digital output
data, which is then summed up together by the shift adder.
By leveraging the ReRAM structure, various CNN acceler-
ators have been proposed for the inference or training [5, 8].
The kernel in a convolution layer is a tensor with 4 dimen-
sions and its mapping on crossbar requires a complicated
design [8, 9]. Fig. 1(b) illustrates an example of the kernel
mapping design in the ReRAM-based accelerator. The filters
of C channels spread into a one-dimension vector and is stored
in one column.
Fig. 1(c) depicts a full ReRAM-based PIM architecture [5,
8], which is based on the main memory structure with support-
ive periphery circuits. For instance, the wordline/bitline drivers
(WL/BL D. in Fig. 1) generates input pulses and controls the
switch of each cell.
B. Deconvolutional Computation
Fig. 2 illustrates two kinds of deconvolutional algorithms:
zero-padding and padding-free. Similar to convolutional com-
putation, stride s and padding p are two hyper-parameters in
the deconvolutional computation. Suppose that the input data
I and consists of a serial feature maps is a IH × IW × C
tensor. Here, IH and IW are respectively the height and width
of each input feature map. C is the number of channel. The
convolution kernel K contains a set of filters and is represented
by a KH ×KW ×C ×M tensor. M is the number of filters,
which is equivalent to the number of output feature maps. Like
the input data, the output O composes of M feature maps each
of which is OH×OW . In this work, each element in the input
and output is referred as pixel. The deconvolution is a up-
sampling operation and therefore OH ≥ IH and OW ≥ IW .
The zero-padding deconvolution (Algorithm 1) includes two
steps: a) Padding: Insert zeros between the pixels in the
input feature maps (denoted as Ipad); b) Convolution: Perform
Algorithm1：zero-padding deconvolution Algorithm2：padding-free deconvolution
           
               
                  
                  
                            
                   
                    
                   
                   
                   
                             
//                            
//         ,          
                       
             
                        
 
//                                   
                   
                   
                   
                   
                   
                             
         *s+                =
                         
//                              
//           ,            
                      
                                                
                    
Fig. 2. Pseudo codes of traditional deconvolution algorithms.
regular convolution for Ipad with kernels. As can be seen, zero-
padding has massive redundant multiplications as the zero-
value input operands in the padding input feature maps.
Algorithm 2 describes the padding-free algorithm with the
following four major steps: a) Rotation: rotate the weight ker-
nel by 180◦; b) Convolution: compute the intermediate results
by multiply-and-accumulating (MAC) an input pixel with the
corresponding kernel in the channel direction; c) Addition:
add the overlapped pixels obtained in step b) together; and d)
Cropping: crop the data at the edge of the output matrices to
fit the size of the final output. Padding-free algorithm avoids
inserting zero into the input in comparison with zero-padding.
However, it introduces two additional operations—addition
and cropping. Previously, Xu et al. [11] successfully utilize
the padding-free algorithm to adapt the CMOS-based hardware
for efficient deconvolutional computation. As we shall show
in Section III-A, the existing ReRAM-based accelerators need
substantial efforts to realize these operations, incurring a large
overhead.
III. RERAM-BASED DECONVOLUTION ACCELERATOR
In this section, we first analyze the computation inefficiency
when mapping the two popular deconvolutional algorithms to
the existing ReRAM-based accelerators. Then we elaborate the
proposed RED: a ReRAM-based deconvolution accelerator de-
sign which exploits pixel-wise mapping and zero-skipping data
flow to perform high-efficient deconvolution computation. We
also analyze the tradeoff in RED between the area overhead
and parallelism.
A. Analysis & Observations
Fig. 3(a) illustrates the zero-padding deconvolution imple-
mentation. The kernel mapping of zero-padding deconvolution
is the same as the standard convolutional computation de-
scribed in Section II: M weights in deconvolutional layer are
mapped on M columns of a ReRAM crossbar. In each cycle,
one input vector is fed into the crossbar for computation and
each pixel in the produced M -bit output vector corresponds
to one-pixel information for M output feature maps. As such,
it will take OH ×OW cycles to obtain the completed data of
the M output feature maps in the shape of OH ×OW . After
C rows
KHKWM columns
Overlap: Add 
together
1
2
   IHIW
ReRAM
Crossbar
IHIW   ··   2  1
CYCLES
(a) ReRAM-based zero-padding deconvolution.
KHKWC 
rows
M columns
1
2
OHOW  
CYCLES
OHOW   ··  2  1
ReRAM 
Crossbar
(b) ReRAM-based padding-free deconvolution.
pixels with non-zero value
pixels with zero value
Fig. 3. Deconvolution on ReRAM-based accelerator.
the padding step (Section II-B), the input vector has inserted a
large number of zeros and becomes very sparse, inducing the
redundant computations on the zero pixel. Fig. 4 presents the
zero redundancy ratio (i.e., the ratio of redundant computation
induced by zero-padding over total computation) when varying
the stride. Typically, the deconvolution layer in GANs (e.g.,
SNGAN [13]) sets the stride step to 2, while FCNs [3] usually
prefer larger strides in deconvolution layers, such as 8, 16,
or 32. As shown in Fig. 4, the zero redundancy ratio is
already 86.8% when stride = 2 and grows up to amazingly
99.8% when stride = 32. The high zero redundancy ratios
indicates there are a large amount of redundant operations
when ReRAM-based accelerator performs deconvolution. Note
that ReGAN [12] adopted the zero-padding deconvolution but
neglected the redundant operations.
Padding-free is an alternative deconvolution algorithm that
escapes the zero redundancy. The previous study [11] showed
that the padding-free deconvolution achieved up to 44.9× per-
formance improvement on the CMOS-based platforms, such
as ASIC. However, our analysis shows that direct mapping
the padding-free algorithm on a ReRAM architecture might
not be efficient. As depicted in Fig. 3(b), different from the
zero-padding deconvolution with a compacted output in M
columns, the implementation of padding-free deconvolution
on a crossbar requires KH × KW × M columns. As the
wordline/bitline driving power increases in a quadratic relation
with the column number, the padding-free deconvolution ex-
pects a much higher power consumption than the zero-padding
deconvolution. What’s more, the output from the crossbar is
not the final result but requires further processing (addition
and cropping), which leads to dedicated circuit support and
extra area cost.
B. RED Architecture
To overcome the aforementioned problems, we propose
RED—a new ReRAM-based deconvolution accelerator. The
SNGAN[12]
input:4x4
FCN[3]
input:16x16
70%
80%
90%
100%
1 2 4 8 16 32
Z
er
o
 R
e
d
u
n
d
a
n
cy
Stride
Fig. 4. The zero redundancy ratio in zero-padding deconvolution changing
with the stride.
design combines two orthogonal approaches, respectively for
minimizing the redundant operations induced by the padded
zeros and for enhancing the execution parallelism without
additional operations. For ease of the explanation, we take
a deconvolutional computation with stride = 2 and the kernel
filter size of 3×3 as the example in the following description.
The overall RED architecture is presented in Fig. 5(a).
Here, the computation of the deconvolution is executed by
KH × KW sub-crossbars (denoted as “SC”) each of which
is the size of C ×M . To clarify, we demonstrate the padded
zeros in Fig. 5. During the deconvolutional computation, RED
takes only those non-zero pixels (in purple) to form the
input vectors. The partial results from the corresponding sub-
crossbars are summed up to obtain the output pixels. In each
clock cycle, multiple pixels for each output feature map are
generated concurrently. In the following, we will elaborate
the details of the pixel-wise mapping in Fig. 5(b) and zero-
skipping data flow in Fig. 5(c).
1) Pixel-wise Mapping: We propose the pixel-wise map-
ping to eliminate the high zero redundancy induced by zero-
padding algorithm. Fig. 6 explains the design principle. In
the figure, the large grid refers to a padded input feature
map, whose non-zero and zero pixels are denoted in purple
and white colors, respectively. The small grid with numbers
indicates the kernel with its weight location, in which only the
purple bricks are the valid weights in utilization. Due to the
high redundancy of the padded image, only a small portion of
weights take part in the convolution operation.
Fig. 6(a)∼(d) illustrates the four computation modes when
sliding the kernel filter within the input feature map. For the
given configuration, a kernel filter has nine weights, labeled
with numbers 1 ∼ 9. Starting from the first convolutional
computation in Fig. 6(a), there are only four weights (1, 3,
7 and 9) contributing to the calculation result. The following
convolution by sliding the kernel filter horizontally one step
involves only two weights 4 and 6, as shown in Fig. 6(b).
Similarly, the computation modes in Fig. 6(c) & (d) occur
when moving the kernel window down one grid from the
positions in Fig. 6(a) & (b), respectively. We observe that the
convolution operations in the deconvolution are the repetition
of the four computation modes. Furthermore, the weights of
the kernel filter are exclusive among these modes. Thus, we
propose pixel-wise mapping to execute the computation modes
(a)∼(d) in parallel.
We map a kernel in size of KH × KW × C × M into
Sub-Crossbar KhKw
Sub-Crossbar 2
(b) Pixel-wise mapping.
Kw 
Kh 
C channels
SC 
9
SC 
7
SC 
3
SC 
1
I(0,0) I(1,0) I(1,0) I(1,1)
SC 
8
SC 
2
SC 
6
SC 
5
SC 
4
I(2,4) I(2,2) I(0,4) I(0,2)Cycle 2
Cycle 1
O(1,1)
O(1,0)
O(0,1)
O(0,0)
Cycle 1
Cycle 2
O(1,3)
O(1,2)
O(0,3)
O(0,2)
(c) Zero-skipping data flow.
OhOw/4  ··· 2  1
1 2
3 4
Sub-crossbar 1 C rows
Sub-crossbar 2
Sub-crossbar 
KhKw
+
+
+
+
1 2
3 4
1 2
3 41 2
3 4
M columns
1
2
3
4
1
2
OhOw/4
1
OhOw/4
1
OhOw/4
Cycles
(a) Overall design.
pixels with non-zero 
value
pixels with zero value
KhKw
Sub-Crossbar 1
C
 ro
w
s
M columns
Sub-Crossbar Tensor 
Fig. 5. The illustration of RED architecture(a), pixel-wise mapping(b) and zero-skipping data flow(c).
KH×KW sub-crossbars. Each sub-crossbar has C inputs and
M outputs, thus can be expressed as a matrix whose shape is
C×M . Suppose that combining all the sub-crossbars can form
a sub-crossbar tensor (SCT), whose shape is C×M×(KH×
KW ), as shown in Fig. 5(b), then our pixel-wise mapping
approach can be expressed as:
SCT[c,m, i ∗KW + j] =W[i, j, c,m], (1)
where 0 ≤ i < KH and 0 ≤ j < KW indicate the location of
the weight in each filter, 0 ≤ c < C denotes the cth channel
of the weight filter, 0 ≤ m < M refers to mth weight filter.
Once a round of computation in sub-crossbars is completed,
we add the output from corresponding SCs to obtain the final
deconvolution results. Thanks to the vertical sum-up design
in the existing ReRAM-based accelerators [8, 12], no extra
circuitry is needed to realize the addition operations in pixel-
wise mapping.
2) Zero-skipping Data Flow: Based on the pixel-wise
mapping scheme, we develop the zero-skipping data flow
which takes non-zero pixels as the inputs of SCT. Fig. 5(c)
illustrates the operation for the given example with stride = 2
and kernel size of 3×3. Accordingly, there are 9 sub-crossbars.
The output vectors from the sub-crossbars on the same row
will be added up together; and sub-crossbars along the same
column will take the same input vectors. For brevity, we use
I(i, j) to denote the input vector and has C pixels each of
which from one channel.The proposed data flow bypasses
padded zeros, so i and j corresponding to the index on padded
image are always even numbers. In Cycle 1, I(0, 0) goes to
SC1, I(2, 0) is provided to SC2 and SC3, I(0, 2) is taken
by SC4 and SC7, and I(2, 2) is applied to SC5, SC6, SC8
（a）
1 2 3
4 5 6
7 8 9
（b）
1 2 3
4 5 6
7 8 9
（c）
1 2 3
4 5 6
7 8 9
（d）
1 2 3
4 5 6
7 8 9
Fig. 6. The four computation modes in deconvolution when the kernel size
is 3× 3 and the stride is 2.
and SC9. The 9 sub-crossbars operate simultaneously and
their outputs will be put together upon the above explanation
for the final deconvolution results. In the following cycle,
RED continues to compute the kernels with the next batch
of non-zero pixels, e.g., I(0, 2), I(0, 4), I(2, 2) and I(2, 4) in
Cycle 2 as illustrated in the figure. Compared to the zero-
padding deconvolution, the zero-skipping data flow increases
the computation parallelism of this example 4×.
C. Design Trade-off
We use stride = 2 to illustrate the RED design. The
deconvolution with stride = 2 can be decomposed into four
computation modes and therefore achieve 4× speedup by
RED. The number of computation modes is stride2, indicating
the speed-up brought by RED quadratically increases with the
stride.
The kernel size usually grows with the stride. For the
FCN [3] with stride = 8, the kernel filter size is 16 × 16.
Accordingly, 256 sub-crossbars are needed to complete the
entire computation modes simultaneously. More sub-crossbars
can cause the increment of the area due to the extra word-
line/bitline driver, column mux, shift-adder, etc.).
There exists a trade-off between the area and the execution
speeup in RED. When the kernel filter size is too large, RED
can take more time for computation in exchange for the area
efficiency. We can reduce the number of sub-crossbars to half
of its original number by adding zeros to the input vector.
Suppose that the size of sub-crossbar tensor SCT is C×M ×
(KH×KW ). In area-efficient design, the shape of SCT shape
is 2C ×M × KH×KW2 . The data flow changes as below:
Cycle 1 : In[c]c=1,...,C = I2n,ori[c]c=1,...,C ;
In[c]c=C+1,...,2C = 0;
Cycle 2 : In[c]c=1,...,C = 0;
In[c]c=C+1,...,2C = I2n+1,ori[c]c=1,...,C ;
(2)
where In denotes the input vector of nth modified sub-crossbar
(0 ≤ n < KHKW2 ) and In,ori is the original input vector in the
pixel-wise mapping method. In this way, we employ 128 sub-
arrays to complete the 64 computation modes in two cycles
when stride = 8 and kernel filter size of 16× 16.
TABLE I
BENCHMARKS USED IN THIS WORK
Layer Name Network Model Dataset Input Size(IH , IW , C)
Output Size
(OH , OW ,M )
Kernel Size
(KH ,KW , C,M )
Stride
GAN Deconv1 DCGAN [14] LSUN (8, 8, 512) (16, 16, 256) (5, 5, 512, 256) 2
GAN Deconv2 Improved GAN [15] Cifar-10 (4, 4, 512) (8, 8, 256) (5, 5, 512, 256) 2
GAN Deconv3 SNGAN [13] Cifar-10 (4, 4, 512) (8, 8, 256) (4, 4, 512, 256) 2
GAN Deconv4 SNGAN [13] STL-10 (6, 6, 512) (12, 12, 256) (4, 4, 512, 256) 2
FCN Deconv1 voc-fcn8s 2x [3] PASCAL VOC (16, 16, 21) (34, 34, 21) (4, 4, 21, 21) 2
FCN Deconv2 voc-fcn8s 8x [3] PASCAL VOC (70, 70, 21) (568, 568, 21) (16, 16, 21, 21) 8
TABLE II
BREAKDOWN COMPONENT
Component Abbr.
Array (a)
Computation c
Wordline Driving wd
Bitine Driving bd
Periphery (pp)
Multiplexer mux
Decoder dec
Read Circuit / Integrated & Fire Circuit rc
Shift Adder sa
IV. EXPERIMENTS
This section evaluates RED in terms of performance, energy
consumption, and area overhead. We compare RED with the
conventional zero-padding and padding-free design using the
deconvolutional layers from representative neural networks.
A. Experimental Setup
We modified NeuroSim+ [16] to implement the conven-
tional zero-padding design, padding-free design, and our
proposed RED design. The system ran at the 2GHz clock
frequency and employed 1T1R ReRAM cell structure and
65nm technology node. The benchmark includes several de-
convolutional layers from a set of representative neural net-
works models including GANs and FCNs. The details of the
benchmark used in our work are summarized in Table I.
The performance of the three designs for each benchmark
is provided hereinafter, including latency, energy consumption
and area overhead. All the results are normalized to that of
the zero-padding design. For analysis purpose, we present the
results by separating the contributions from array and periph-
ery circuitry. Table II lists the detailed breakdown components
and their abbreviations.
We select the layers from GANs and FCNs as the bench-
mark in order to evaluate the performance of RED in various
deconvolution applications. The deconvolution layer in GANs
usually has a larger amount of input channels and output
channels. As such, the kernel size is usually large, e.g.,
5× 5× 256× 256 for GAN Deconv1. In contrast, the kernel
size in FCNs is usually much smaller, such as 16×16×21×21
in voc-fcn8s. Such a difference in configuration indicates that
in GANs, the array resources could outweigh the peripheral
circuitry, while the situation in FCNs is opposite. This distinc-
tion between GAN and FCN deconvolution is clearly reflected
in the evaluation results as we shall present in the following.
B. Experimental Results & Analysis
1) Latency: Fig. 7 presents the total and breakdown of
latency of the three design implementations obtained from the
following calculation:
Ltotal = (Lwd + Lbd)a + (Ldec + Lmux + Lrc + Lsa)pp. (3)
Fig. 7(a) shows that RED annexes the advantages of both
padding-free and zero-padding designs. It acquires the lowest
total latency and achieves highest speedup across all the
benchmarks. The performance improvement of RED benefits
from two aspects: 1) it eliminates the zero redundancy in input
vectors and diminishes the number of cycles; and 2) the size of
output vectors is the same as the zero-padding design, hence
the two designs have the similar array latency, which is much
lower than that of the padding-free design. Compared to the
zero-padding design, RED achieves 3.69 ∼ 31.15× speedup.
Fig. 7(b) presents the breakdown of the execution time.
Compared to the padding-free design, RED reduces the array
latency because of the smaller size of output vectors and thus
the lower latency caused by wordline driving. The padding-
free design has longer array latency for its much longer output
vector. Compared to the zero-padding design, RED arouses
76.9%∼96.8% less array and periphery latency. The zero-
padding design requires stride2× number of cycles compared
to the other two designs after adding zero redundancy to
input vectors, which induces extensive periphery latency to
the computation. When stride = 2 (such as the GANs and
FCN Deconv4), the zero-padding design reaches 4× periphery
latency compared to the padding-free design and RED. Despite
the fact that the padding-free design produces more array
0
20
40
60
80
100
1E-01
1E+00
1E+01
1E+02
GAN_De
conv1
GAN_De
conv2
GAN_De
conv3
GAN_De
conv4
FCN_De
conv1
GAN_De
conv1
N
o
rm
a
li
ze
d
 L
a
te
n
cy
(%
)
periphery latency
array latency
S
p
ee
d
u
p
1E+02
1E+ 0
1E-01
1E+01
(a) The speedup performance.
(b) The execution time breakdown.
GAN_De
conv1
GAN_De
conv2
GAN_De
conv3
GAN_De
conv4
FCN_De
conv1
GAN_De
conv1
1
0
0
0
0
0
zero-padding padding-free RED
Fig. 7. The latency comparison.
1E-01
1E+00
1E+01
0
50
100
150
200
250
3007
6
4
50
0N
o
rm
a
li
ze
d
 E
n
er
g
y
(%
)
periphery energy
array energy
zero-padding padding-free RED
E
n
er
g
y
 S
a
v
in
g
1E+01
1E+00
1E-01
GAN_De
conv1
GAN_De
conv2
GAN_De
conv3
GAN_De
conv4
FCN_De
conv1
FCN_De
conv2
GAN_De
conv1
GAN_De
conv2
GAN_De
conv3
GAN_De
conv4
FCN_De
conv1
FCN_De
conv2
(a) Energy saving results.
(b) The energy breakdown.
Fig. 8. Then energy comparison.
latency than the zero-padding design and RED in GANs, the
zero-padding design still holds 1.55 ∼ 2.62× longer latency
than the padding-free design.
2) Energy: Fig. 8 presents the total and breakdown of
energy consumption of the padding-free, zero-padding and
RED. The following equation shows the breakdowns of the
energy consumption.
Etotal = (Ec+Ewd+Ebd)a+(Edec+Emux+Erc+Esa)pp. (4)
Experimental results demonstrate that RED outperforms the
other two implementations in the total energy efficiency. Ow-
ing to the prodigious energy consumption for wordline/bitline
driving, the array energy of the padding-free design is conspic-
uously considerable, which is about 4.48 ∼ 7.53× compared
to the other two designs. For this reason, the padding-free
design consumes up to 6.68× more energy than the others
when implementing GAN where the array contributes more.
Due to the fact that the total size of the ReRAM crossbar
array remains the same, the zero-padding design and RED
have the similar array energy. The periphery energy of RED
is lower than that of the zero-padding design as the input
data size of each crossbar is reduced, and thereby decoders
consume less energy. In total, RED saves 8% ∼ 88.36%
energy consumption than the zero-padding design.
3) Area: Fig. 9 shows the breakdown of the area overhead
of the three designs. For the sake of brevity, we show only
a handful of cases. Similar area overhead is observed for
all the layers of GANs and FCNs considered in our study.
Likewise, the area overhead has two parts—array area and
periphery area. The results demonstrate that three designs incur
the same array area because of their identical kernel size. The
padding-free design procures higher area overhead (9.79% in
GANs and 116.57% in FCNs) in counting of numerous output-
related circuits. The disparity between the area overhead of the
padding-free design and the zero-padding design is remarkable
in FCNs. The reason is the difference (KH × KW times)
in the output sizes of the two designs. More specific, it is
25× in GAN Deconv1 but 256× in FCN Deconv2. Compared
with the zero-padding design, the proposed RED introduces
21.41% higher area overhead. The overhead increases mainly
because the pixel-wise mapping method augments output-
related periphery circuits by splitting the crossbar apart.
N
o
rm
a
li
ze
d
 A
re
a
(%
)
0
20
40
60
80
100
120
0
20
40
60
80
100
120
GAN_De
conv1
FCN_De
conv2
120
100
0
80
60
40
20
0
216.57%
zero-padding
padding-free
RED
periphery area
array area
Fig. 9. The area comparison.
V. CONCLUSION
This work introduces RED, a high-performance and energy-
efficient ReRAM-based deconvolution accelerator. Through
the optimization of the mapping design and data flow, RED
eliminates the redundant computations and avoids the over-
head of the incremental periphery circuitry. Experimental
evaluation shows that RED outperforms the existing ReRAM-
based accelerators for the common deconvolutional compu-
tation algorithms, with up to 31.15× speedup and 88.36%
energy consumption reduction.
ACKNOWLEDGEMENTS
This work was supported by US Department of En-
ergy (DOE) SC0017030. Bing Li acknowledges the National
Academy of Sciences (NAS), USA for awarding the NRC
research fellowship.
REFERENCES
[1] Jiajun Wu et al. Learning a probabilistic latent space of object
shapes via 3d generative-adversarial modeling. In NIPS, pages
82–90, 2016.
[2] Raymond Yeh et al. Semantic image inpainting with perceptual
and contextual losses. arxiv preprint. arXiv:1607.07539.
[3] Jonathan Long et al. Fully convolutional networks for semantic
segmentation. In CVPR, pages 3431–3440, 2015.
[4] Shifeng Zhang et al. Single-shot refinement neural network for
object detection. In IEEE CVPR, 2018.
[5] Ping Chi et al. Prime: A novel processing-in-memory archi-
tecture for neural network computation in reram-based main
memory. In SIGARCH Comput. Archit. News, volume 44, pages
27–39, 2016.
[6] Ming Cheng et al. Time: A training-in-memory architecture for
rram-based deep neural networks. TCAD, 2018.
[7] Ali Shafiee et al. Isaac: A convolutional neural network accel-
erator with in-situ analog arithmetic in crossbars. SIGARCH
Comput. Archit. News, 44(3):14–26, 2016.
[8] Linghao Song et al. Pipelayer: A pipelined reram-based accel-
erator for deep learning. In HPCA, pages 541–552, 2017.
[9] Ximing Qiao et al. Atomlayer: a universal reram-based cnn
accelerator with atomic layer computation. In DAC.
[10] Bing Li et al. Reram-based accelerator for deep learning. In
DATE, pages 815–820, 2018.
[11] Dawen Xu et al. Fcn-engine: Accelerating deconvolutional
layers in classic cnn processors. In ICCAD, 2018.
[12] Fan Chen et al. Regan: A pipelined reram-based accelerator for
generative adversarial networks. In ASP-DAC.
[13] Takeru Miyato et al. Spectral normalization for generative
adversarial networks. arXiv:1802.05957, 2018.
[14] Alec Radford et al. Unsupervised representation learn-
ing with deep convolutional generative adversarial networks.
arXiv:1511.06434, 2015.
[15] Tim Salimans et al. Improved techniques for training gans. In
NIPS, pages 2234–2242, 2016.
[16] Pai Yu Chen et al. Neurosim+: An integrated device-to-
algorithm framework for benchmarking synaptic devices and
array architectures. In IEDM, pages 6–1, 2018.
