Communication Lower Bound in Convolution Accelerators by Chen, Xiaoming et al.
Communication Lower Bound in Convolution Accelerators
Xiaoming Chen1,2, Yinhe Han1,2, Yu Wang3
1Center for Intelligent Computing Systems, State Key Laboratory of Computer Architecture,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
2University of Chinese Academy of Sciences, Beijing, China
3Department of Electronic Engineering, Tsinghua University, Beijing, China
Email: {chenxiaoming, yinhes}@ict.ac.cn, yu-wang@tsinghua.edu.cn
Abstract—In current convolutional neural network (CNN)
accelerators, communication (i.e., memory access) dominates the
energy consumption. This work provides comprehensive analysis
and methodologies to minimize the communication for CNN
accelerators. For the off-chip communication, we derive the
theoretical lower bound for any convolutional layer and propose a
dataflow to reach the lower bound. This fundamental problem has
never been solved by prior studies. The on-chip communication
is minimized based on an elaborate workload and storage
mapping scheme. We in addition design a communication-optimal
CNN accelerator architecture. Evaluations based on the 65nm
technology demonstrate that the proposed architecture nearly
reaches the theoretical minimum communication in a three-
level memory hierarchy and it is computation dominant. The
gap between the energy efficiency of our accelerator and the
theoretical best value is only 37-87%.
Index Terms—Convolutional neural network (CNN), CNN
accelerator, communication lower bound
I. INTRODUCTION
Convolutional neural networks (CNNs) have achieved great
successes in numerous practical applications (e.g., [1]–[3]).
The reliable results produced by modern CNNs exclusively
rely on the complex models and large amounts of data, which
in turn bring significant demands in both performance and
energy efficiency. Recently, a number of hardware accelerators
based on either application-specific integrated circuits (ASICs)
or field-programmable gate arrays (FPGAs) have been pro-
posed to boost the performance and the energy efficiency of
CNNs.
Due to the large amount of data and complex data reuse
patterns in convolution computation, CNN accelerators often
involve a great number of memory accesses. Inputs and
weights are typically stored in the off-chip dynamic random-
access memory (DRAM). A static random-access memory
(SRAM) based on-chip global buffer (GBuf) stores portions of
inputs and weights which are loaded from the DRAM. Each
processing element (PE) has some registers (Regs) to store
inputs and weights which are read from the GBuf. Partial sums
(Psums) are stored in the GBuf or Regs. During computation,
there is complex data transmission in the memory hierarchy.
Normally, communication, but not computation, dominates the
energy consumption of a CNN accelerator. A DRAM access
consumes 2 to 3 orders of magnitude higher energy than an
This paper will appear in 2020 26th IEEE International Symposium on
High-Performance Computer Architecture (HPCA’20).
arithmetic operation [4] and the DRAM access energy can be
more than 90% of the total energy consumption of a CNN
accelerator [5], [6]. For the on-chip aspects, Regs can take up
a large portion (>50%) of the chip energy while arithmetic
units consume less than 20% [7]. Therefore, from an energy
point of view, current CNN accelerators are communication
dominant. Minimizing the communication, therefore, is the
key for improving the energy efficiency of CNN accelerators.
Maximizing data reuse in convolutions helps reduce com-
munication. Data reuse heavily depends on the convolu-
tional dataflow. There are various approaches to optimize
the dataflow: 1) designing an elaborate dataflow [5], [7]–
[18], 2) selecting the best dataflow from several candidates
[19]–[23], and 3) design space exploration (DSE) [24]–[32].
A fair number of these studies focus on the performance
and/or the energy efficiency of the computational components.
The energy-dominant component, communication, has not
been comprehensively investigated. Moreover, in most existing
studies, the dataflow is designed based on intuitive/heuristic
analysis, which may not guarantee the optimality.
If the inputs, weights, and outputs of a convolutional layer
are accessed exactly once at every level of the memory hi-
erarchy, the layer-wise minimum communication is obviously
reached. However, such an ambitious goal requires a huge on-
chip memory. The requirement of memory resources varies for
different applications. Thus, under given hardware resources,
searching for a dataflow and an architecture that minimize
the communication has much more practical significance. This
problem has never been solved. In this work, we provide
detailed analysis and methodologies to reach the lower bounds
of both off-chip communication and on-chip communication.
Specifically, we make the following contributions in this paper.
• We solve a fundamental problem in CNN accelerators:
what the lower bound of the off-chip communication of
a convolutional layer is, if it is implemented on a CNN
accelerator with a limited on-chip memory. We provide
a mathematical derivation for this problem.
• We demonstrate that convolutions have only one more
level of data reuse (sliding window reuse) than matrix
multiplications (MMs). Based on this conclusion, we
elaborate a dataflow which fuses sliding window reuse
and a communication-optimal MM implementation, to
minimize the off-chip communication.
• We propose a workload and storage mapping scheme such
ar
X
iv
:1
91
1.
05
66
2v
3 
 [c
s.D
C]
  1
7 J
an
 20
20
WO
H
O
...
+
+
×
×
×
×
Weights (CO kernels)Inputs (B images) Outputs (B images)
WK
H
K
Sliding window...
...
Fig. 1: Convolutional layer in CNNs.
for (i = 0; i < B; i++)           //Images in a batch
 for (oz = 0; oz < Co; oz++)      //Output channels
  for (oy = 0; oy < Ho; oy++)     //Output rows
   for (ox = 0; ox < Wo; ox++)    //Output columns
    for (kz = 0; kz < Ci; kz++)   //Input channels
     for (ky = 0; ky < Hk; ky++)  //Kernel rows
      for (kx = 0; kx < Wk; kx++) //Kernel columns
       out[i][oz][oy][ox] +=
        in[i][kz][oy+ky][ox+kx] * w[oz][kz][ky][kx];
Fig. 2: Pseudo code of a convolutional layer.
that both GBuf communication and Reg communication
respectively reach their lower bounds.
• A communication-optimal CNN accelerator architecture
is proposed, which not only reaches the minimum com-
munication, but also can adapt to different convolutional
layer dimensions with high resource utilization.
The significance of this work is not purely on the proposed
dataflow and/or architecture, but more importantly, from the
point of view of a theoretical basis, to reveal the design
methodology and principle to minimize the communication
for CNN accelerators.
II. BACKGROUND
A. Convolutional Layers
Fig. 1 illustrates a general convolutional layer in CNNs. We
have B input images in a batch and CO kernels of weights,
producing B output images (only 1 input image and 1 output
image are shown in Fig. 1). Each input image has CI channels
and each output image has CO channels. The output channel
dimension is HO×WO. Each kernel is a CI×HK×WK 3D
matrix. Each output is computed by an inner product between
the inputs in a sliding window on the input image and the
weights in a kernel. The stride size is the position difference
between two adjacent sliding windows. Fig. 2 lists the pseudo
code of a convolutional layer. It contains 7 levels of loops and
assumes the unit stride size.
From a quick glance of Fig. 2, finding a dataflow with
minimized communication is challenging, due to the huge
search space caused by different loop orders, loop stride sizes,
loop unrolling schemes, etc. There are several data reuse
patterns in convolutions, including input reuse (InR, an input
is used by multiple kernels), sliding window reuse (WndR,
an input is used by multiple overlapped sliding windows),
weight reuse (WtR, a weight is used by multiple inputs), and
output reuse (OutR, an output resides on chip during the entire
computational process). Multiple data reuse patterns can be
combined to form more complicated dataflows. Maximizing
data reuse also involves a huge search space.
In this work, we only consider the ordinary convolution
algorithm, which is the most popular approach adopted by
hardware accelerators. Those convolution algorithms with
lower computational complexity, such as the Winograd algo-
rithm [33] and fast Fourier transform based approaches [34],
are not considered. We target at minimizing the communica-
tion of general convolution operations, so that our approach
can be adopted in both inference and training of CNNs.
B. Related Work
A number of CNN accelerators are designed with an elabo-
rate dataflow to optimize some objective(s) (e.g., performance,
bandwidth, etc.) [5], [7]–[18]. Unfortunately, their dataflows
are designed almost based on intuitive/heuristic analysis. In
other words, they claimed the superiorities of the dataflows
and/or accelerators but failed to explain why the designs
are essentially the best. Such designs may not guarantee the
optimality. A representative state-of-the-art is Eyeriss [7], [10]
which claimed that the communication is optimized. We will
show by experiments that neither its off-chip communication
nor its on-chip communication is minimized.
Rather than using a single dataflow, several studies have in-
tegrated multiple dataflows into an accelerator (with increased
hardware cost) and selected the best one according to the layer
dimensions [19]–[23]. These approaches usually perform bet-
ter than the approaches based on a single dataflow. However,
the claimed optimality is only the best one among the given
candidates. If the defacto best solution is not included in the
candidates, they cannot find the optimal solution.
To find the optimal dataflow with a particular objective, a
possible approach is to exhaustively consider all possible loop
orders and tiling sizes (i.e., the stride sizes for the loops). This
is the DSE approach [24]–[32]. However, the search space is so
huge that an exhaustive search is extremely time-consuming.
For instance, only considering two loops to minimize the off-
chip communication of a particular layer leads to an enormous
search space of 7.2×1013 [29]. Heuristics have to be adopted
to find sub-optimal solutions. Exhaustive methods lack uni-
versality since they cannot tell people why the found dataflow
is essentially the best. In this sense, for a new convolutional
layer, re-conducting an exhaustive search is usually needed,
as we do not know whether a known dataflow is still the best
for the new layer.
In the aforementioned approaches, some studies ( [7], [11],
[19], [26]–[29]) have considered communication optimiza-
tion, while the others mainly focus on the computational
components. Besides the three categories, there are other
communication optimization approaches for CNN accelerators
(e.g., the fused-layer approach [35] that optimizes data move-
ment between convolutional layers). Currently, no study has
comprehensively analyzed the lower bound of communication
in CNN accelerators.
Ref. [36] analyzed the on-chip memory requirement such
that both inputs and weights are read from the off-chip DRAM
exactly once. This is the minimum possible off-chip communi-
cation. However, the required on-chip memory to achieve this
goal is quite large (from several million bytes to hundreds of
million bytes). On the other hand, the hardware resources are
fixed but applications’ requirements vary, so it is impossible
to guarantee the goal all the time. In practice, searching for
the minimum communication under given hardware resources
has much more significance.
C. Preliminary: Red-blue Pebble Game
Our derivation for the lower bound of the off-chip commu-
nication heavily depends on the red-blue pebble game [37],
which is a theoretical model to estimate the minimum volume
of data transmission between two levels of memories. The
derived lower bound is the best possible, in the sense that it
is achievable by certain algorithm implementations. Here we
review an important theorem of the red-blue pebble game as
a preliminary of our derivation.
Suppose that the memory hierarchy comprises an unlimited
slow memory and a limited fast memory. When optimizing the
off-chip communication, they refer to the off-chip DRAM and
the on-chip memory (e.g., SRAMs or Regs), respectively. The
fast memory can hold only S data entries. An algorithm is
described by a directed acyclic graph (DAG), in which each
node represents a data entry or an operation (producing a data
entry as the output of the operation) and each edge represents
an inter-data dependency. We skip the definition of the original
red-blue pebble game here because it is usually difficult to
use [38]. Instead, the red-blue pebble game can equivalently
be converted to an easier S-partition problem [37], which is
defined as follows.
Let G(V,E) be a DAG describing an algorithm, where V
and E are the node and edge sets, respectively. A partition
on G is called an S-partition, if the following four properties
hold.
• Property 1: V is partitioned into h subsets V1,V2,· · · ,Vh,
such that Vi’s are disjoint and their union is V .
• Property 2: there is no cyclic dependency among Vi’s.
• Property 3: for any Vi (1≤ i≤h), there exists a dominator set
Di (nodes in Di are not necessarily in Vi) such that |Di|≤S.
A dominator set Di for Vi is a set of nodes in V such that
any path from an input of G to a node in Vi contains some
nodes in Di.
• Property 4: for any Vi (1≤ i≤h), the output set size |Oi|≤S.
The output set Oi of Vi is a set of nodes in Vi that do not
have any immediate successors in Vi.
Let P (S) be the minimum number of subsets that any
S-partition of a DAG must have. The following theorem
describes the communication lower bound based on the S-
partition model (the proof is provided in [37]).
Theorem 1. Given a fast memory of size S, to finish a DAG
that describes an algorithm, the minimum communication
volume Q between the fast memory and the slow memory
satisfies
Q ≥ S · (P (2S)− 1) . (1)
W
O *H
O
W
O *H
O
W
K *H
K *C
I
CO
B
*W
O *H
O
CO
WO
H
O
...
WK
H
K
B
*W
O *H
O
WK*HK*CI
× =
...
Unfolded input 
matrix
Output 
matrix
W
eight m
atrix
=*
...
...
... ...
...
... ...... ... ...
Fig. 3: Convolution-to-MM conversion.
III. LAYER-WISE LOWER BOUND OF OFF-CHIP
COMMUNICATION
We now derive the layer-wise lower bound of the off-
chip communication, based on the S-partition model [37] and
the relation between convolutions and MMs. Typically there
are at least three levels in the memory hierarchy of a CNN
accelerator: an off-chip DRAM, an on-chip GBuf, and Regs.
The red-blue pebble game is still applicable. We define an
effective on-chip memory as the maximum on-chip memory
that does not contain duplicated data. For example, if the GBuf
stores inputs and weights while some Regs store Psums (other
Regs store inputs and weights that are copied from the GBuf),
the effective on-chip memory refers to the GBuf (storing
inputs and weights) plus those Regs which store Psums. A
specific implementation may be a sub-optimum, since the red-
blue pebble game assumes a homogeneous on-chip memory
without any specific splitting.
A. Relation Between Convolutions and MM
Fig. 3 shows how to convert a convolutional layer into
an MM. We first consider a simple case with batch size 1.
The input image is unfolded to the input matrix, each row
of which contains the inputs in a sliding window. Different
rows in the unfolded input matrix correspond to different
sliding windows, which also correspond to different locations
on the output channels. All kernels are reshaped into a weight
matrix, each column of which contains the weights of a kernel.
The output image is reshaped into an output matrix, each
column of which contains the outputs of an output channel.
Reshaping means reorganizing the elements without adding or
removing elements. If the batch size is B, we just stack up B
unfolded input matrices and B output matrices, respectively,
while the weight matrix remains unchanged. The stacked input
and output matrices are still called unfolded input matrix and
output matrix, respectively.
The convolution-to-MM conversion is only logic equivalent
but not algorithm equivalent. The difference is that, in a
convolutional layer, inputs in overlapped sliding windows can
be reused. This level of data reuse does not exist in MMs. This
is why the input matrix is “unfolded” instead of “reshaped”. In
the conversion, the input images are unfolded by expanding all
sliding windows, i.e., the common inputs in overlapped sliding
windows have multiple explicit copies in the unfolded input
matrix. We define R to denote the reuse number of each input
by WndR, whose maximum value is
R =
WK ×HK
D2
(2)
where D is the stride size. We will show that the derived lower
bound of the off-chip communication relies on R.
One may argue that there are other data reuse patterns
in a convolutional layer (e.g., InR, WtR, etc.). These data
reuse patterns are actually included in the converted MM. For
example, each column of the weight matrix can be shared
by multiple rows in the unfolded input matrix, which is
WtR, and each row of the input matrix can be reused by
multiple columns in the weight matrix, which is InR. From the
conversion process, it is clear that the computational process of
a convolutional layer is not changed except for WndR, because
except for that the input matrix is unfolded, the other matrices
are just reshaped. This implies that, although a convolutional
layer involves 7 levels of loops, it only has one more level
of data reuse than MMs. In order to take into account WndR
in the converted MM, we have defined R to denote the reuse
number of each input by WndR. If R is 1 (i.e., no WndR), a
convolutional layer is exactly equivalent to an MM. Since a
fully-connected (FC) layer is also equivalent to an MM, our
conclusion with R=1 can be applied to FC layers. Note that
the convolution-to-MM conversion is a only a logical operation
used for our derivation. It is not a real operation in our dataflow
or architecture.
B. Theoretical Derivation
Here we provide the theoretical derivation for the layer-wise
lower bound of the off-chip communication. We consider a
general case in which the on-chip memory cannot hold all
inputs or all weights of a convolutional layer. Otherwise, it is
just the ideal case (both the inputs and the weights are read
exactly once).
Lemma 1. If a convolutional layer is represented by a DAG,
the number of internal and output nodes in the DAG is
2BWOHOCOWKHKCI .
Proof. Fig. 4 illustrates a DAG that describes a convolutional
layer. It has three levels. The first level comprises all input
nodes, including inputs and weights. The second level is com-
posed of all multiplication nodes. The last level is composed
of all add nodes. The multiplication and add nodes associated
with the same output form an add tree (multiplication nodes
are also included in add trees). The detailed connections
between the input nodes and the multiplication nodes are not
shown because we are not interested in them.
There are BWOHOCO outputs in total. Each output is the
inner product of two vectors, both of which are of length
WKHKCI . Hence, there are WKHKCI multiplications nodes
...
...
......
Input activation
Weight
Multiplication
Add
...
...
C
onnections
C
onnections
ai
wj
aiwj
aiwj
Add tree
...
...
...
C
onnections

A
n input im
age
A
n output im
age
......

+
+
+
+
+
+
+
+
+
+
+
+
+
x
x
x
x
x
x
x
x
x
x
x
x
x
Fig. 4: DAG describing a convolutional layer.
and WKHKCI add nodes in an add tree. Since no internal
node can be shared by different add trees, the number of
internal and output nodes is 2BWOHOCOWKHKCI .
In Fig. 4, the inputs are marked as a1, a2, · · · , and the
weights are marked as w1, w2, · · · . Each multiplication node
produces a term aiwj . Note that the sum of multiple terms
(e.g., a1w1+a2w2) is not called a term. Instead, a term belongs
to a sum (i.e., an add tree). We have the following lemma.
Lemma 2. Let T (S) be the maximum number of terms that
can be produced in no more than S add trees by using no more
than S on-chip memory units. For a convolutional layer with
each input reused by R times by WndR, T (S)=O(S
√
RS).
Proof. The proof is based on the relation between convolu-
tions and MMs. We use A, B and C to denote the unfolded
input matrix, the weight matrix, and the output matrix, re-
spectively. Then the MM is represented as AB = C. The
produced terms using no more that S on-chip memory units
can be arbitrarily distributed in C. Note that an element in C
is the sum of multiple terms belonging to the said sum, so the
produced terms may overlap in C.
We first demonstrate that in order to maximize the produced
terms, the produced terms must form a single block or be able
to form a single block. This phenomenon can be explained
intuitively. We consider any two elements (i.e., two sums) in
C, each of which is a product of a row vector in A and
a column vector in B. Obviously, if we move one element
such that the two elements are overlapped in C (then the
input vectors are also overlapped), the number of produced
terms keeps unchanged but the number of required on-chip
memory units is minimized. In what follows, we provide a
mathematical proof for this statement.
Without loss of generality, we consider any two non-
overlapped rectangular sum blocks C1 and C2 in C. The size
of Ci is ui×zi (the minimum size is 1×1). Block Ci is the
product of two corresponding blocks in A and B, respectively,
say Ai and Bi, which are of sizes ui×ki and ki×zi, as shown
in Fig. 5. Then each element inCi is the sum of ki terms. Note
B
*W
O *H
O
WK*HK*CI CO
W
K *H
K *C
I
u
i
zi
k
i
ki
Unfolded input 
matrix A Output matrix C
W
eight m
atrix B
Ai
Bi
u
i
zi
Ci
Fig. 5: Converted matrix multiplication.
that A1 and A2 can overlap, or B1 and B2 can overlap, but
they cannot overlap at the same time (otherwise C1 and C2
will overlap). According to the definition of T (S), we intend
to maximize
T (S) = u1k1z1 + u2k2z2. (3)
If there is no overlap in anyAi orBi blocks, since all inputs
and outputs should be in no more than S on-chip memory
units, we have the following constraint
u1k1
R
+
u2k2
R
+ z1k1 + z2k2 + u1z1 + u2z2 ≤ S (4)
where the first two terms on the left-hand side are reduced
by a factor R because an element in A can be reused by at
most R times by WndR. Equation (4) also implies that the
produced terms are in no more than S add trees. Based on the
generalized mean inequality [39], we have
uiki
R
+ ziki + uizi ≥ 33√R (uikizi)
2
3 (5)
where the equality holds iff uikiR =ziki=uizi. Combining (4)
and (5), we have
(u1k1z1)
2
3 + (u2k2z2)
2
3 ≤ S
3
√
R
3
. (6)
Let ti
∆
=(uikizi)
2
3 , then we have formulated a maximum value
problem to maximize T (S) = t
3
2
1 + t
3
2
2 under the constraint
t1+t2≤ S
3√
R
3 . Since T (S)= t
3
2
1 +t
3
2
2 is continuous and strictly
convex, its upper bound is reached on the boundary of the
variables’ value range, i.e., when there is one ti = S
3√
R
3 and
the other ti is 0. Accordingly, the upper bound of T (S) is
S
√
RS
3
√
3
=O(S
√
RS). The upper bound can be reached iff there
is only one i such that zi=
√
S√
3R
and ui=ki=
√
SR√
3
, implying
that there is only a single block in C.
If there is overlap between A1 and A2 or between B1 and
B2, without loss of generality, we assume that A1 and A2 are
overlapped (then B1 and B2 cannot overlap). The constraint
is as follows
u1k1
R
+ z1k1 + z2k2 + u1z1 + u2z2 ≤ S,
u2k2
R
+ z1k1 + z2k2 + u1z1 + u2z2 ≤ S.
(7)
This is equivalent to
u1k1 ≤ R (S − z1k1 − z2k2 − u1z1 − u2z2) ,
u2k2 ≤ R (S − z1k1 − z2k2 − u1z1 − u2z2) .
(8)
Then
T (S) =
√
u1k1
√
u1k1z1 +
√
u2k2
√
u2k2z2
≤
√
R (S−z1k1−z2k2−u1z1−u2z2)
(√
u1k1z1+
√
u2k2z2
)
.
(9)
Since ziki+uizi≥2
√
kiuizi, we have
T (S)
≤
√
R
(
S−2
√
k1u1z1−2
√
k2u2z2
)(√
k1u1z1+
√
k2u2z2
)
(10)
where the equality holds iff ki = ui. Let t
∆
=
√
k1u1z1+√
k2u2z2. Based on the generalized mean inequality, we get
T (S) ≤
√
R(S − 2t) t =
√
R(S − 2t)√t√t
≤
√
R
(
S − 2t + t + t
3
) 3
2
=
S
√
RS
3
√
3
= O(S
√
RS)
(11)
where the last equality holds iff t= S3 . When tracing back the
derivation process, the upper bound of T (S) can be reached
iff k1 = k2 = u1 = u2 =
√
SR√
3
and z1 +z2 =
√
S√
3R
, where the
latter condition implies that C1 and C2 are able to be merged
into a single block, and the resulting case is identical to the
case without overlap.
We have proved that for any two blocks in C, only when
they form a single block or be able to form a single block, the
number of produced terms is maximized. When extending this
conclusion to the general case with multiple blocks in C, they
should also form a single block or be able to form a single
block (say, C1) with u1 =Rz1 held. If so, the upper bound of
T (S), S
√
RS
3
√
3
=O(S
√
RS), can be reached.
Lemma 3. Let {V1, V2,· · ·, Vh} be an S-partition of the DAG
associated with a convolutional layer. Each Vi (1≤ i≤h) can
have at most 2T (S)+S internal and output nodes.
Proof. By Property 4 of the S-partition model, the output set
of Vi has at most S nodes. This implies that Vi can have nodes
in at most S add trees. To bound the internal and output nodes
that Vi can have, we only need to consider S add trees. By
property 3 of the S-partition model, there is a dominator set
Di for Vi that has no more than S nodes. By the definition
of T (S), from Di at most T (S) terms can be formed in S
add trees. T (S) terms can form at most T (S) add nodes in
Vi. Considering that nodes in Di (|Di| ≤ S) can possibly be
internal or output nodes of Vi, Vi can have at most 2T (S)+S
internal and output nodes.
Based on Lemmas 1, 2 and 3, for a DAG that describes a
convolutional layer, the minimum number of subsets that any
S-partition must have is
P (S) = Ω
(
BWOHOCOWKHKCI
S
√
RS
)
. (12)
According to Theorem 1, we get the following theorem, which
is also the key conclusion of this paper.
Theorem 2. The lower bound of the off-chip communication
of a convolutional layer is
QDRAM = Ω
(
BWOHOCOWKHKCI√
RS
)
. (13)
The off-chip communication volume of a naive convo-
lution implementation (without any data reuse) is simply
2BWOHOCOWKHKCI . The lower bound reduces it by a
factor of
√
RS. If R is 1, then a convolutional layer is
exactly equivalent to an MM. In this case, the reduction factor
is
√
S, which is consistent with the communication-optimal
implementation of MMs [37].
It is worth mentioning that the derived lower bound is in
the form of Ω instead of a precise value. It represents the
asymptotic relation between the off-chip communication and
the on-chip memory capacity when the problem scale is large
enough. It is possible that some dataflows can bring less off-
chip communication in some special cases (e.g., cases of small
workloads).
IV. COMMUNICATION-OPTIMAL DATAFLOW
In this section, we elaborate our dataflow with minimized
off-chip communication based on the above derivation. The
on-chip communication is minimized based on a proposed
workload and storage mapping scheme.
A. Dataflow with Minimized Off-Chip Communication
The dataflow with minimized off-chip communication is
derived from the proof process of Lemma 2. More precisely,
in Fig. 5, the output matrix C is partitioned into equal-sized
blocks of size u×z. The block size should satisfy u≈Rz and
also meet the on-chip memory capacity. A block needs the
data in the two yellow bands in the unfolded input matrix A
and the weight matrixB. Actually, the communication-optimal
implementation of MM is also the blocked method described
in Fig. 5 [40]. When we map the blocked implementation back
to a convolutional layer, we get Fig. 6.
A block in C can be mapped to a z×y×x (u=xy) 3D sub-
matrix in the output images (e.g., the green block in Fig. 6). If
the output channel dimension is too small (i.e., WOHO<xy),
the said output sub-matrix may be from multiple (say, b)
images in a batch. In this case, u = bxy. To compute the
y'
x'
WO
H
O
...
=
...
*
WK
H
K

z
x
y u=xy
...
... ...
Fig. 6: Dataflow to achieve the lower bound of off-chip
communication.
for (i = 0; i < B; i+=b)           //Batch tiling
 for (oz = 0; oz < Co; oz+=z)      //Output channel tiling
  for (oy = 0; oy < Ho; oy+=y)     //Output row tiling
   for (ox = 0; ox < Wo; ox+=x)    //Output column tiling
   {for (kz = 0; kz < Ci; kz+=k)   //Iterations
    {GBuf<=in[i:i+b-1][kz:kz+k-1][oy:oy+y+Hk-1][ox:ox+x+Wk-1];
     GBuf<=w[oz:oz+z-1][kz:kz+k-1][:][:];
     out[i:i+b-1][oz:oz+z-1][oy:oy+y-1][ox:ox+x-1] += in * w;}
    DRAM<=out[i:i+b-1][oz:oz+z-1][oy:oy+y-1][ox:ox+x-1];}
Fig. 7: Pseudo code of the proposed dataflow.
output sub-matrix, the inputs in the corresponding x′ × y′
(x′ = x+WK−1 and y′ = y+HK−1 if D = 1) locations
from all input channels of b images (i.e., the yellow block
in the input images) and z kernels associated with the partial
output channels (i.e., the kernels colored yellow) are needed,
as shown in Fig. 6. Due to the limited on-chip memory, it
might be impossible to load all required data at a time. Instead,
it is computed by a series of iterations. In each iteration, in
the yellow blocks, we load the inputs from a portion (say, k)
of the input channels and the corresponding weights to the
on-chip memory, shown by the red blocks in Fig. 6. Then
we can perform a partial update to the output sub-matrix. To
complete the output sub-matrix, we continuously load inputs
and weights in the yellow blocks and perform partial updates.
For an output sub-matrix, the needed inputs and weights are
read from the off-chip DRAM exactly once. Different output
sub-matrices in the output images are computed sequentially
in the same way. Fig. 7 lists the pseudo code of the dataflow.
Any quadruple {b, z, y, x} (i.e., tiling sizes) defines an imple-
mentation of the dataflow. For a fixed quadruple {b, z, y, x},
k does not affect the off-chip communication. However, under
a given on-chip memory capacity, smaller k results in larger
output sub-matrices, and thus, less output sub-matrices. Hence,
k should be the smallest value, namely, 1.
To explain why this dataflow is superior, we notice that it
fully exploits OutR, since Psums reside on chip during the
computational process and are written back to the off-chip
DRAM only once after the computation is finished. WndR (an
input is reused by at most R sliding windows on each x′×y′
plane) is also fully exploited. More importantly, it also takes
into account InR (an input is reused by weights in z kernels)
and WtR (a weight is reused by b×x×y inputs) at the same time.
However, neither InR nor WtR is fully utilized (for example,
the loaded inputs are only reused by the loaded weights but
not by all kernel weights). This implies that maximizing either
InR or WtR is never the optimal solution. In fact, our approach
utilizes InR and WtR in a balanced way, generating equal
loading volumes of inputs and weights. To sum it up, our
dataflow fully exploits OutR and WndR and also combines
InR and WtR in a balanced way.
We now verify that the proposed dataflow is able to achieve
the lower bound of the off-chip communication. There are
(BWOHOCO)/(bxyz) blocks in total in the output images.
For each block, WKHKCIz weights and bx′y′CI inputs are
needed. The DRAM read volume is
QRead =
BWOHOCO
bxyz
(WKHKCIz + bx
′y′CI)
≈ BWOHOCO
bxyz
(
WKHKCIz +
WKHKbxyCI
R
) (14)
where the approximation holds if R≈ WKHKD2 , x′≈Dx, and
y′ ≈Dy (when x1 and y1, these approximations hold).
The DRAM write volume is BWOHOCO, which does not
depend on {b, z, y, x}. If bxy=u≈Rz, then
QDRAM≈ 2BWOHOCOWKHKCI√
Ruz
+BWOHOCO. (15)
If uz≈S (for minimizing the read volume) and WKHKCI√
RS
1
(for ignoring the write volume), (15) satisfies Theorem 2. This
implies that, to reach the minimum off-chip communication,
most of the effective on-chip memory should be assigned
to Psums (since uz ≈ S). The fundamental principle behind
this conclusion is to use the least inputs to produce the most
outputs, implying that data reuse is maximized. In addition, for
layers with few weights, the lower bound of (13) may not be
tight, since WKHKCI√
RS
1 does not hold and the write volume
BWOHOCO cannot be ignored.
In fact, a few prior studies have more or less discussed
similar dataflows [7], [19], [41]. However, they failed to find
the superiorities of this dataflow due to the intuitive analysis
and the lack of theoretical basis. Ref. [7] evaluated several
OutR dataflows but the poor implementations brought ∼50%
of the energy consumed by inter-PE communication which is
actually unnecessary. Ref. [19] considered an OutR dataflow
but the tiling sizes were not properly selected. The convolution
implementation proposed in [41] is for graphics processing
units rather than for hardware accelerators. Ref. [42] proposed
a dataflow for CNN accelerators which explicitly converts any
convolutional layer into an MM without exploiting WndR.
B. Workload and Storage Mapping with Minimized On-Chip
Communication
Here we focus on the computation of an iteration (i.e., the
red line in Fig. 7). The required inputs and weights for an
iteration have been loaded to the GBuf. The workload of an
iteration is mapped to a PE array that consists of p×q PEs.
We will introduce a workload and storage mapping scheme to
minimize both GBuf communication and Reg communication.
1) Minimizing GBuf Communication: There is a major dif-
ference between optimizations of the off-chip communication
and the on-chip communication. When minimizing the off-
chip communication, since the problem scale can be arbitrary
but the hardware resources are fixed, tiling is necessary and
the workload is finished by a number of sequential iterations.
For an iteration, however, since the output sub-matrix size is
limited by the on-chip memory capacity, it is possible to design
the PE array size and the Reg capacity such that the hardware
resources can handle the workload of an iteration at a time.
This difference leads to a different lower bound — the loaded
inputs and weights (in the GBuf) can be read exactly once.
This is no doubt the minimum possible GBuf communication.
Without loss of generality, a PE is the smallest computa-
tional unit that has a multiplication-accumulation (MAC) unit.
zsy'
x'
z
u=b*x*y
k*W
K *H
K
k
b*x'*y'
y
x
...
WK
H
K
...
...

b
z
b
Loaded inputs 
in GBuf
Loaded w
eights 
in G
B
uf
Output sub-matrix
zs
ys
xs'
ys '
Reshaped input 
sub-matrix
Reshaped output 
sub-matrix
Reshaped weight 
sub-matrix
xs
...
xs'
...

ys xs


Reshape
Reshape
Reshape
Sliding window reuse

ys'
For one PE
Fig. 8: Workload mapping in an iteration.
z
k*W
K *H
K
k
b*x'*y' zs
PE
1,1
xs '*ys '
R
eshaped input 
sub-m
atrix
xs *ys
... PE1,2
PE
2,1
Share 
inputs
Share 
weights
Reshaped 
weight 
sub-matrixWndR in GRegs
bxy*z 
Regs
......
......WK
H
K
xs'
ys '
xs *ys
zs
WKHK-th pass
Workload of one PE
GRegs
GBuf (SRAMs)
LRegs
1st pass
2nd pass
Fig. 9: Workload and storage mapping in an iteration.
Like the dataflow to minimize the off-chip communication,
each PE computes xs×ys outputs in zs output channels, so
each PE contributes to a zs×ys×xs (≥(bxyz)/(pq)) block in
the output sub-matrix, as illustrated in Fig. 8. The produced
outputs by p×q PEs should cover the reshaped output sub-
matrix (bxy×z). Fig. 9 details the workload mapping for two
PEs (PEs 1,1 and 1,2) and the workload of one PE. For one PE,
x′sy
′
sk inputs and zskWKHK weights are needed (remember
that k=1 in practice). However, we do not need to load them
at a time. To enable WndR on each x′s×y′s plane (see Fig. 8),
x′s×y′s inputs (for one PE) are loaded to the Regs. Since WndR
cannot be applied to weights, we just load zs weights (for one
PE) to the Regs. In an iteration, if updating all outputs once
is called a pass (the ith pass computes the ith Psums of all
outputs), in each pass, a PE uses xsys inputs and zs weights to
produces xsyszs Psums (see the workload of one PE shown in
Fig. 9). A pass needs xsyszs clock cycles. The loaded inputs
can be used for WKHK passes (because WndR is exploited
in the Regs). The loaded weights can be used just for 1 pass,
so a PE needs to load zs weights to the Regs in every pass.
To complete an iteration, p×q PEs need kWKHK passes.
When considering the PE array, PEs in the same row share
the loaded inputs and PEs in the same column share the loaded
weights. As a result, each weight in the GBuf is read exactly
once, reaching the minimum communication. The average read
count of each input in the GBuf is (x′sy
′
s)/(xsys) which is
larger than 1. The extra reads are from the halos (i.e., the
inputs out of the xs×ys rectangle but in the x′s×y′s rectangle)
on each input channel. It is possible to avoid reading extra
halos by designing a complicated data transmission network,
as an input in a block’s halo is also an input of another block,
such that each input in the GBuf is also read exactly once.
We prefer reading extra halos as it simplifies the hardware
design and regularizes the read patterns. Ideally, the Regs for
storing inputs and weights can be global Regs (GRegs) instead
of PEs’ local Regs (LRegs) (for example, in Fig. 9, the x′s×y′s
GRegs are shared by the first PE row). In practice, to avoid
large fanouts and long latency of long wires, we partition the
PE array into groups and each group shares a set of GRegs,
with little extra Reg communication.
We choose to store Psums in PEs’ LRegs. An alternative
way is to store Psums in the GBuf, which reduces the Reg
capacity. However, a Psum needs to be loaded to a Reg
when it is being updated and stored back to the GBuf when
updated, resulting in lots of data shuffling between the GBuf
and Regs, and thus, high energy consumption. Hence, storing
Psums in the GBuf is not energy efficient. Keeping Psums
in Regs completely avoids GBuf access for Psums. Thus, the
communication between the GBuf and Regs is minimized.
By utilizing our workload and storage mapping, the GBuf
capacity can be reduced. Since weights are read row by row
from the reshaped weight sub-matrix and inputs are read
column by column from the reshaped input sub-matrix (see
Fig. 9), we do not need to load kWKHK × z weights and
bx′y′×k inputs to the GBuf at a time. Instead, we only need
one row of SRAMs for weights and one column of SRAMs
for inputs. Once data in the GBuf are loaded to the GRegs,
the GBuf is used for prefetching data for the subsequent pass.
2) Minimizing Reg Communication: Psums are stored in
PEs’ LRegs. Since each MAC operation needs a Reg write,
the minimum number of Reg writes is the number of MAC
operations, i.e.,
QReg = # of MACs. (16)
This is no doubt the minimum Reg communication. Keeping
Psums in LRegs naturally reaches this lower bound, which
minimizes the dynamic energy of LRegs. On the other hand,
the static energy of LRegs should also be optimized.
Suppose that each PE has r (≥ xsyszs) LRegs to store
Psums. For a PE, in each cycle, at most one Reg is written and
the other r−1 Regs just consume static energy. If r is large, the
static energy consumption of the Regs may dominate the total
Reg energy. Increasing the PE array size (i.e., pq) can reduce
r, with increased arithmetic component power. However, with
more PEs, the execution time is reduced so that the energy
of the arithmetic components almost keeps unchanged. From
an energy point of view, using more PEs causes lower static
energy consumption of the Regs, though the arithmetic power
dissipation will increase.
Using GRegs to share inputs and weights to the PE array
completely avoids inter-PE communication. Duplicating inputs
and weights from the GBuf to GRegs brings little extra Reg
communication. Thus, the Reg communication is minimized.
C. Summary
We summarize the communication lower bound here. The
theoretical lower bound of the off-chip communication is de-
fined in (13), while a more practical lower bound is described
in (15). The lower bound of the GBuf communication is the
off-chip communication of inputs and weights. The lower
bound of the Reg communication is defined in (16). There
are two key conditions to achieve the lower bound: bxy≈Rz
(for setting the tiling sizes) and bxyz≈S (most of the on-chip
memory capacity should be assigned to Psums).
The superiorities of our dataflow and workload and storage
mapping scheme come from three aspects. First, our dataflow
and workload mapping scheme fully exploit OutR and WndR,
and also combine InR and WtR in a balanced way. Our
dataflow is actually a combination of a communication-optimal
MM implementation and WndR. The optimal dataflow and
workload mapping scheme help reduce both DRAM commu-
nication and GBuf communication. Second, the concurrency of
PEs is exploited to share inputs and weights by GRegs. Third,
Psums are stored in PEs’ LRegs. The last two points both
help reduce GBuf communication and Reg communication.
By combining these techniques, our approach can practically
reach the minimum communication in a three-level memory
hierarchy for convolution accelerations.
V. COMMUNICATION-OPTIMAL CNN ACCELERATOR
ARCHITECTURE
In this section, we propose a CNN accelerator architecture
with minimized communication, based on the theoretical con-
clusions of the previous section. According to the implication
of (15), most of the effective on-chip memory should be as-
signed to Psums to minimize the off-chip communication. We
use an example containing 64KB Psums and p×q=16×16 PEs
to describe the design methodology of our CNN accelerator.
We use 16-bit fixed-point arithmetic units, so there are 32K
(32768) entries for Psums and each PE has 128 entries.
Based on the workload and storage mapping scheme illus-
trated in Fig. 9, we design our architecture as shown in Fig. 10.
The architecture mainly comprises a PE array, GRegs, two
GBufs (an input GBuf (IGBuf) and a weight GBuf (WGBuf)),
a controller, and some first-in first-out (FIFO) buffers that
connect the off-chip DRAM and the on-chip memories.
Input G
B
uf
Weight GBuf
DRAM
GRegs
PE
PE group 
(pg*qg PEs)
Controller
...
...
... ...
...
...
... ...
...
...
... ...
...
...
... ...
FIFO
...
...
...
...
Fig. 10: Architecture of our CNN accelerator.
GBufs: According to the discussions of Section IV-B1, to
avoid long wires, the PE array is partitioned into PE groups
and each PE group (pg×qg PEs) shares a set of GRegs (see
Fig. 10). In our example, pg=qg=4. All GReg rows (columns)
store the same weights (inputs), and the same position in all
GReg rows (columns) is written at the same time.
We discuss how to determine the sizes of the GBufs.
Remember that most of the effective on-chip memory should
be assigned to Psums (i.e., S ≈ 32768) and the tiling sizes
{b, z, y, x} should satisfy bxy≈Rz to minimize the off-chip
communication. If R=1 (i.e., no WndR), bxy≈z≈181. This
is the approximate maximum value of z, so we set the size of
the WGBuf to 256 entries (0.5KB). With larger R, bxy also
becomes larger. Considering that the maximum R is typically
9 (WK =HK = 3 and D= 1, see (2)), the maximum bxy is
543. Since the IGBuf should store bx′y′ (slightly larger than
bxy) inputs from b y′×x′ input channel planes (see Fig. 8),
we set the size of the IGBuf to 1024 entries (2KB). We leave
some extra entries in the GBufs to adapt to various tiling sizes.
Even so, the GBuf capacity is still very small. Once data in
the GBufs are loaded to the GRegs, the GBufs are used for
prefetching inputs and weights for the subsequent pass. The
prefetching is (partially) overlapped with computation.
Inputs and weights stored in the GBufs are just in the order
as in the reshaped input and weight sub-matrices (see Fig. 8).
This is the natural order when loading them from the DRAM.
No special order is needed. Inputs are not unfolded so we can
exploit WndR on chip.
GRegs: A GReg row (storing weights) is shared by pg PE
rows so pg×q (4×16) PEs share a GReg row. Data stored
in each GReg row are copied from the WGBuf. To adapt to
different z values, we elaborate a multiplexer (MUX) structure,
as shown in Fig. 11. There are q (16) 256q -to-1 (16-to-1) weight
MUXes connecting the WGBuf and the q (16) PE columns.
Slightly different from the workload mapping shown in Fig. 9,
here the zs channels computed by a PE is not consecutive but
have a stride size q (16). The inputs of the q (16) weight
MUXes are arranged in a round-robin way, so that the input
range exactly covers all entries of the WGBuf. To adapt to
64 entries
... ......
...
............
...
...
......
...
GRegs (weights)
16 16-to-1
weight MUXes
16 G
R
eg segm
ents 
(inputs)
Outputs
128-entry 
LRegs
PE
Stride 16
......
...... ......
64-to-1 
input MUX
256 entries
Fig. 11: PE and GReg architectures (numbers are for our
example).
different z and zs values, we just control the selection signals
of the weight MUXes. For instance, if z=64 (so zs=4), the
selection signals of the weight MUXes are from 0 to 3, so that
only the first 64 entries of the WGBuf can be selected. Such
a weight MUX structure avoids the use of a complicated data
transmission network (e.g., a network-on-chip).
To exploit WndR in the GRegs, each GReg column (storing
inputs) is partitioned into p (16) segments. A GReg segment
has 64 entries and is shared by 1×qg (1×4) PEs. Each GReg
segment loads x′sy
′
s inputs (see Fig. 9) from the IGBuf. The
x′sy
′
s inputs can be used in WKHK passes to compute xsyszs
Psums. Each GReg segment has a 64-to-1 MUX to provide
inputs to the 1×qg (1×4) PEs. The selection signals of the
input MUXes are from 0 to x′sy
′
s−1 so that only the first x′sy′s
entries of the GReg segments can be selected.
PEs: A PE comprises a MAC unit and a set of LRegs (128
entries) for Psums. Our architecture does not need LRegs in
each PE to store inputs or weights. A PE computes a Psum
and writes the accumulated result to an LReg in each cycle.
All PEs operate synchronously. This means that, at the same
moment, the selection signals of all input MUXes are identical,
the selection signals of all weight MUXes are identical, and
the read and write positions of all LRegs are also identical.
Controller: Our architecture has a global controller, which
schedules the computational process. It is a finite-state ma-
chine that generates control signals for all components, in-
cluding the read/write signals and addresses of all memories
and the selection signals of all MUXes. No local controller is
needed in each PE.
VI. EXPERIMENTAL RESULTS
Our CNN accelerator is implemented in Verilog. We syn-
thesize it with Design Compiler based on the 65nm technol-
ogy. We use Memory Compiler to generate the GBufs. The
power dissipation is evaluated with PrimeTime. CACTI [43]
is employed to evaluate the latency and energy consumption
of a 2GB DDR3 DRAM (the peak bandwidth is 6.4GB/s).
The core frequency is 500MHz and the DRAM frequency
is 100MHz. A cycle-accurate simulator is built to evaluate
the performance with memory access latency taken into ac-
count. The representative state-of-the-art, Eyeriss [7], [10],
is the baseline for comparison (detailed off-chip and on-chip
communication volumes are reported in [10]). The workload
is VGGNet-16 [44] with batch size 3, the same as the
workload used in [10]. VGGNet has diverse layer dimensions,
including large/shallow layers, small/deep layers, and layers
with medium size/depth.
TABLE I: Five implementations of our architecture.
Implementation # 1 2 3 4 5
# of PEs 16×16 32×16 32×32 32×32 64×32
GBuf size (KB) 2.5 2.5 2.5 3.625 3.625
LReg size/PE (B) 256 128 64 128 64
GReg size (KB) 10 15 18 27 36
Effective on-chip
memory size (KB) 66.5 66.5 66.5 131.625 131.625
TABLE II: Energy consumption of operations.
MAC 4.16pJ LReg (256B) access 3.39pJ
GBuf (0.5KB) access 0.30pJ LReg (128B) access 1.92pJ
GBuf (2KB) access 1.39pJ LReg (64B) access 1.16pJ
GBuf (3.125KB) access 2.36pJ DRAM (2GB) access 427.9pJ
y
x
y
x
y
x
WtR-A
...
WtR-B
...
InR-CInR-BInR-AOutR-A OutR-B
an x*y plane CO outputs k*y*x inputs k input channels CI*y*x inputs
z*k*WK*HK 
weights z kernels
WK
H
K
üïïïïýïïïïþ
z
ìïïïïíïïïïî
Fig. 12: Different dataflows for comparison.
We evaluate five implementations of our accelerator with
different PE numbers and on-chip memory sizes, as listed in
Table I. Table II lists the energy consumption of the basic
operations, estimated by our simulations.
A. DRAM Access Volume
We compare our dataflow with other dataflows based on
different data reuse patterns, as shown in Fig. 12, in which
the colored blocks reside on chip for reuse. For example, in
InR-A, a k× y× x block resides on chip for reuse, while
the associated weights and outputs are shuffled on and off
chip when necessary. These dataflows should cover the most
popular ones used in literature. For example, ShiDiaoNao [12]
uses OutR-A.
Fig. 13 compares the DRAM access volume under different
effective on-chip memory sizes. The lower bound is calculated
by (15). To make a fair comparison and to remove the impact
of improper tiling sizes, the tiling sizes of all dataflows are
obtained by exhaustive searches (since the loop order is fixed,
searching for the best tiling sizes is fast, typically shorter
than 0.1s). The found minimum is obtained by searching for
the best dataflow with the best tiling sizes for each layer.
Fig. 13 demonstrates that our dataflow produces almost the
same DRAM access volume as the found minimum, and the
difference is only 4.5% on average. To understand why our
dataflow does not produce the least DRAM access volume
for all layers, we have mentioned at the end of Section III
that the derived lower bound is in the form of Ω instead of a
precise value. However, despite that, it is unnecessary to select
 
0.125
0.25
0.5
1
2
4
8
16
32
16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
D
R
A
M
 a
cc
es
s v
ol
um
e (
G
B)
Effective on-chip memory (KB)
Lower bound Found minimum Our dataflow
OutR-A OutR-B WtR-A
WtR-B InR-A InR-B
InR-C
Fig. 13: Comparison of different dataflows under different
effective on-chip memory sizes.
 
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13
D
R
A
M
 a
cc
es
s 
vo
lu
m
e (
M
B)
Convolutional layer index
Inputs Weights Outputs/Psums
Lower 
bound 
Our dataflow 
WtR-A 
InR-A 
Our accelerator  
(implems. 1-3) 
Fig. 14: Per-layer comparison of different dataflows (66.5KB
effective on-chip memory).
the best dataflow from multiple candidates, as the expected
improvement in the DRAM access volume is less than 5%.
Our dataflow produces 10% more DRAM access volume on
average than the theoretical lower bound. The 2nd and 3rd best
dataflows, InR-A and WtR-A, respectively produce 45.1% and
45.8% more DRAM access volume than ours.
Fig. 14 shows the per-layer DRAM access volume of the
lower bound, our dataflow, our implementations 1-3, InR-
A, and WtR-A. The difference between our dataflow and
our implementation is that the latter has a fixed on-chip
memory splitting (e.g., 64KB Psums plus 2.5KB GBufs in our
implementations 1-3). Due to this reason, our implementations
1-3 produce 3-4% more DRAM access than our dataflow,
indicating tiny impacts of the fixed on-chip memory splitting.
Our dataflow and implementations produce balanced input and
weight access volumes, while outputs take up a small portion
of the DRAM access volume. For InR-A and WtR-A (the 2nd
and 3rd best dataflows), outputs involve a large portion of
the DRAM access volume, and the input and weight access
volumes are not balanced, leading to much larger memory
access volumes.
We try to make an apple-to-apple comparison with pub-
lished data but find it difficult. Ref. [10] reported the DRAM
access volume of VGGNet-16 with input compression on
Eyeriss. Ref. [19] selected the best dataflow with the minimum
DRAM access volume from three candidates. Inputs, weights,
and outputs are pruned in [19]. Our work targets at general
CNN accelerators without data pruning/compression, so the
results reported in [10], [19] are not directly comparable to
ours. Instead, we try to make an approximate comparison.
Eyeriss has a 108KB GBuf but the effective on-chip mem-
ory capacity is 173.5KB, since 100KB of the GBuf stores
inputs and outputs (the other 8KB is used for prefetching
weights), while weights are stored in PEs’ local SRAMs
(each PE has 448B local SRAMs) [10]. Under the 173.5KB
 
0
20
40
60
80
100
1 2 3 4 5 6 7 8 9 10 11 12 13
D
R
A
M
 a
cc
es
s 
vo
lu
m
e (
M
B)
Convolutional layer index
Lower bound Our dataflow
Eyeriss (compressed) Eyeriss (uncompressed)
Fig. 15: Comparison with Eyeriss on DRAM access (173.5KB
effective on-chip memory).
TABLE III: Comparison with Eyeriss on DRAM access
(173.5KB effective on-chip memory).
DRAM access (MB) DRAM access/MAC
Lower bound 274.8 0.0030
Our dataflow 299.7 0.0033
Eyeriss (compr.) 321.3 0.0035
Eyeriss (uncompr.) 528.8 0.0057
effective on-chip memory limit, we compare our dataflow
and Eyeriss with and without input compression, as shown
in Fig. 15 and Table III. Ref. [10] has reported the per-layer
input compression ratios of VGGNet-16 but the proportion
of the input access volume in the total access volume is not
reported. We use the proportion of our dataflow to evaluate
the off-chip DRAM access volume for Eyeriss without input
compression. Our dataflow reduces 43.3% DRAM access
volume than Eyeriss without input compression. Our dataflow
even produces 6.7% less DRAM access volume than Eyeriss
with input compression.
We notice from Fig. 15 that for layer 1, Eyeriss produces
a lower DRAM access volume than the lower bound. This is
because the derived lower bound is in the form of Ω instead of
a precise value. It represents the asymptotic relation between
the off-chip communication volume and the on-chip memory
capacity when the problem scale is large enough. Special cases
exist for small workloads. However, the first layer typically
takes up a small portion of the off-chip communication vol-
ume, so that its impact on the overall energy efficiency and
performance is negligible.
Compared with FlexFlow [22] with 192KB on-chip memory
(64KB GBuf and 512B/PE local storage) which selects the
best dataflow from several candidates, the DRAM access/MAC
metric of our dataflow (173.5KB effective on-chip memory)
is 33% better (0.0033 versus 0.0049).
B. GBuf Access Volume
Fig. 16 shows the GBuf access volume of our accelerator
and the comparison with Eyeriss. Our implementations (with
smaller total and effective on-chip memory capacities) produce
much less GBuf communication than Eyeriss, and the reduc-
tion factors are 10.9-15.8×. The large reduction is due to the
elimination of data shuffling between the GBuf and LRegs.
To understand how our accelerator reaches the minimum
GBuf communication, we list the DRAM and GBuf access
 
1
10
100
1000
1 2 3 4 5 6 7 8 9 10 11 12 13
G
Bu
f a
cc
es
s 
vo
lu
m
e (
M
B)
Convolutional layer index
Eyeriss Our accelerator (implem. 1)
Our accelerator (implem. 2) Our accelerator (implem. 3)
Our accelerator (implem. 4) Our accelerator (implem. 5)
Fig. 16: Comparison with Eyeriss on GBuf access (vertical
axis is in logarithmic scale).
TABLE IV: Ratio of GBuf access volume to DRAM access
volume (for our accelerator implementation 1).
DRAM access GBuf access
Read Write Read Write
Inputs 187.5MB 0 313.5MB (1.67×) 216.2MB (1.15×)
Weights 196.6MB 0 196.6MB (1.00×) 196.6MB (1.00×)
Outputs 0 77.5MB 0 0
 
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12 13
R
eg
 a
cc
es
s 
vo
lu
m
e 
(G
B)
Convolutional layer index
Lower bound Our accelerator (implem. 1)
Our accelerator (implem. 2) Our accelerator (implem. 3)
Our accelerator (implem. 4) Our accelerator (implem. 5)
Fig. 17: Reg access volume of our accelerator.
volumes of implementation 1 in Table IV. For weights, the
GBuf read and write volumes respectively equal to the DRAM
read volume, reaching the theoretical lower bound. For inputs,
the GBuf write volume is slightly larger than the DRAM read
volume, because the tiling-based dataflow causes some input
or output blocks out of the input or output boundaries, result-
ing in a few redundant GBuf writes. The GBuf read volume
for inputs is 1.67× of the DRAM read volume for inputs. The
extra reads are from the halos of convolution inputs, which is
explained in Section IV-B1. The GBuf read and write volumes
are respectively 1.33× and 1.07× of the DRAM read volume,
indicating that our accelerator roughly reaches the theoretical
lower bound of the GBuf communication.
C. Reg Access Volume
Fig. 17 shows the Reg access volume of our accelerator and
the comparison with the lower bound. The lower bound is cal-
culated from (16). The Reg access volume of our accelerator
is only 5.9-11.8% larger than the lower bound, indicating that
our accelerator almost reaches the theoretical lower bound of
the Reg communication. The extra Reg communication is from
a) the GReg communication, and b) Psums that are out of the
output boundary caused by the tiling-based approach.
We are not able to make a numerical comparison with any
existing CNN accelerator on the Reg communication since no
similar result was found. For an intuitional comparison with
Eyeriss (and other accelerators which propagate data in the PE
array, e.g., [17], [18]), our architecture is expected to reduce
the Reg communication severalfold, because Eyeriss not only
writes Psums to Regs in each cycle (which our accelerator
also has), but also propagates inputs, weights, and Psums in
the PE array (which our accelerator does not have).
D. Energy Efficiency and Performance
Fig. 18 shows the energy efficiency (in pJ/MAC) of our ac-
celerator and the comparison with the lower bound. The lower
bound is calculated by adding together the DRAM access
energy (under the corresponding effective on-chip memory
capacity limit), the MAC energy, and the Reg write energy (of
0 1 2 3 4 5 6 7 8 9 10
Lower bound (1-3)
Lower bound (4-5)
Implem. 1
Implem. 2
Implem. 3
Implem. 4
Implem. 5
pJ/MAC
DRAM GBufs (SRAM) MAC units LRegs GRegs Others
Fig. 18: Energy efficiency of our accelerator.
 
0
1
2
3
4
5
6
0
0.1
0.2
0.3
0.4
0.5
0.6
Implem. 1
(256 PEs)
Implem. 2
(512 PEs)
Implem. 3
(1024 PEs)
Implem. 4
(1024 PEs)
Implem. 5
(2048 PEs)
Po
w
er
 (W
)
Ti
m
e 
(s
)
Computing time Waiting time Power
Fig. 19: Power dissipation and performance of our accelerator.
(# of MACs) writes). The lower bound describes the essential
energy consumption to complete the MAC operations. MAC
operations and Regs dominate the energy consumption of our
accelerator. Our accelerator almost reaches the lower bound
for DRAM communication and MAC operations. For the Reg
energy, our accelerator brings higher energy than the lower
bound. The extra Reg energy is mainly due to the static energy
consumption of the LRegs. With fewer LRegs in each PE,
the Reg energy consumption is decreased. Even so, MAC
operations take up the largest portion in the total energy
consumption, implying that our accelerator is computation
dominant. The gap between the energy efficiency of our
implementations and the best value is only 37-87%, indicating
that our accelerator roughly reaches the best energy efficiency.
According to the measured data reported in [10], the energy
efficiency of Eyeriss with input compression and zero gating
is 22.1pJ/MAC (for on-chip aspects). As a direct numeric
comparison, our accelerator (by simulations) without data
compression or gating is 2.61-3.68× more energy efficient
than Eyeriss for on-chip aspects.
Fig. 19 shows the performance and power dissipation of
our accelerator. With more PEs, the execution time is reduced
and the power is increased. The proportion of waiting time
increases with more PEs. With reduced computational time,
the memory access latency cannot be fully overlapped by
computation so it affects the execution time. Compared with
Eyeriss, our five implementations achieve 9.8-42.3× perfor-
mance gain, with memory access latency taken into account.
E. Memory and PE Utilizations
Fig. 20 shows the memory and PE utilizations of our accel-
erator. The GBuf and GReg utilizations are low because we
have some redundant SRAMs and GRegs to adapt to diverse
tiling sizes caused by different layer dimensions. The LReg
utilization keeps high (>88%) in different implementations,
indicating that most of the LRegs are utilized. Increasing the
PE number can lower the LReg utilization, due to the small
workload of each PE. Since the LRegs dominate the on-chip
 
0
20
40
60
80
100
Implem. 1 Implem. 2 Implem. 3 Implem. 4 Implem. 5
U
til
iz
at
io
n 
(%
)
GBufs GRegs LRegs Memory overall PEs
Fig. 20: Memory and PE utilizations (average utilization of all
layers) of our accelerator.
memories, the overall memory utilization is also high (80.6-
91.0%). The PE utilization keeps very high (>97%). In fact,
all PEs are busy in our implementations. The small quantity of
useless PE workload is caused by the tiling-based approach.
VII. CONCLUSIONS
In current CNN accelerators, communication dominates the
energy consumption and consumes much more energy than
computation. In this work, we provide the theoretical lower
bounds of both off-chip communication and on-chip commu-
nication. Based on the theoretical results, we elaborate our
communication-optimal dataflow as well as a communication-
optimal accelerator architecture. We demonstrate by both
theoretical analysis and experimental results that our dataflow
and architecture are able to practically reach the minimum
communication in a three-level memory hierarchy. Our CNN
accelerator is computation dominant and the energy efficiency
is close to the theoretical best value.
ACKNOWLEDGEMENTS
This work was supported in part by National Key R&D
Program of China under Grant 2018YFA0701500, in part
by Key Research Program of Frontier Sciences, CAS, under
Grant ZDBS-LY-JSC012, in part by National Natural Science
Foundation of China under Grants 61804155 & 61834006,
in part by Youth Innovation Promotion Association CAS, in
part by Young Elite Scientists Sponsorship Program by CAST
under Grant 2018QNRC001, and in part by Beijing Academy
of Artificial Intelligence (BAAI). We thank Profs. Mingzhe
Zhang and Zidong Du for their great help in answering the
reviewers’ comments.
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifica-
tion with Deep Convolutional Neural Networks,” in 25th International
Conference on Neural Information Processing Systems, 2012, pp. 1097–
1105.
[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich Feature
Hierarchies for Accurate Object Detection and Semantic Segmentation,”
in IEEE Conference on Computer Vision and Pattern Recognition, 2014,
pp. 580–587.
[3] R. Collobert and J. Weston, “A Unified Architecture for Natural Lan-
guage Processing: Deep Neural Networks with Multitask Learning,” in
25th International Conference on Machine Learning, 2008, pp. 160–167.
[4] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J.
Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural
Network,” in ACM/IEEE 43rd International Symposium on Computer
Architecture (ISCA), June 2016, pp. 243–254.
[5] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Di-
anNao: A Small-footprint High-throughput Accelerator for Ubiquitous
Machine-learning,” in 19th International Conference on Architectural
Support for Programming Languages and Operating Systems, 2014, pp.
269–284.
[6] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and
Y. Chen, “Cambricon-X: An accelerator for sparse neural networks,”
in 49th IEEE/ACM International Symposium on Microarchitecture (MI-
CRO), Oct 2016, pp. 1–12.
[7] Y. H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture
for Energy-Efficient Dataflow for Convolutional Neural Networks,” in
ACM/IEEE 43rd International Symposium on Computer Architecture
(ISCA), June 2016, pp. 367–379.
[8] N. Shah, P. Chaudhari, and K. Varghese, “Runtime Programmable and
Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep
Convolutional Neural Network,” IEEE Transactions on Neural Networks
and Learning Systems, vol. 29, no. 12, pp. 5922–5934, Dec 2018.
[9] Y. Lin and T. S. Chang, “Data and Hardware Efficient Design for
Convolutional Neural Network,” IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 65, no. 5, pp. 1642–1651, May 2018.
[10] Y. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp.
127–138, Jan 2017.
[11] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, “Memory-
centric accelerator design for Convolutional Neural Networks,” in IEEE
31st International Conference on Computer Design (ICCD), Oct 2013,
pp. 13–19.
[12] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
and O. Temam, “ShiDianNao: Shifting vision processing closer to the
sensor,” in ACM/IEEE 42nd International Symposium on Computer
Architecture (ISCA), June 2015, pp. 92–104.
[13] M. Sankaradas, V. Jakkula, S. Cadambi, S. Chakradhar, I. Durdanovic,
E. Cosatto, and H. P. Graf, “A Massively Parallel Coprocessor for
Convolutional Neural Networks,” in 2009 20th IEEE International Con-
ference on Application-specific Systems, Architectures and Processors,
July 2009, pp. 53–60.
[14] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song, Y. Wang, and H. Yang, “Going Deeper with Embedded
FPGA Platform for Convolutional Neural Network,” in ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, 2016,
pp. 26–35.
[15] S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-NN: An energy-
efficient 1D chain architecture for accelerating deep convolutional neural
networks,” in Design, Automation Test in Europe Conference Exhibition
(DATE), March 2017, pp. 1032–1037.
[16] R. Shi, Z. Xu, Z. Sun, M. Peemen, A. Li, H. Corporaal, and D. Wu,
“A Locality Aware Convolutional Neural Networks Accelerator,” in
Euromicro Conference on Digital System Design, Aug 2015, pp. 591–
598.
[17] J. Jo, S. Kim, and I. Park, “Energy-Efficient Convolution Architecture
Based on Rescheduled Dataflow,” IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 65, no. 12, pp. 4196–4207, Dec 2018.
[18] C. Xin, Q. Chen, M. Tian, M. Ji, C. Zou, X. Wang, and B. Wang,
“COSY: An Energy-Efficient Hardware Architecture for Deep Con-
volutional Neural Networks Based on Systolic Array,” in IEEE 23rd
International Conference on Parallel and Distributed Systems (ICPADS),
Dec 2017, pp. 180–189.
[19] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, and X. Li, “SmartShuttle:
Optimizing off-chip memory accesses for deep learning accelerators,”
in Design, Automation Test in Europe Conference Exhibition (DATE),
March 2018, pp. 343–348.
[20] L. Song, Y. Wang, Y. Han, X. Zhao, B. Liu, and X. Li, “C-Brain:
A deep learning accelerator that tames the diversity of CNNs through
adaptive data-level parallelization,” in 53nd ACM/EDAC/IEEE Design
Automation Conference (DAC), June 2016, pp. 1–6.
[21] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, and S. Wei, “Deep Convolu-
tional Neural Network Architecture With Reconfigurable Computation
Patterns,” IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, vol. 25, no. 8, pp. 2220–2233, Aug 2017.
[22] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, and X. Li, “FlexFlow: A
Flexible Dataflow Accelerator Architecture for Convolutional Neural
Networks,” in IEEE International Symposium on High Performance
Computer Architecture (HPCA), Feb 2017, pp. 553–564.
[23] B. Liu, X. Chen, Y. Wang, Y. Han, J. Li, H. Xu, and X. Li, “Addressing
the Issue of Processing Element Under-utilization in General-purpose
Systolic Deep Learning Accelerators,” in 24th Asia and South Pacific
Design Automation Conference, 2019, pp. 733–738.
[24] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Opti-
mizing FPGA-based Accelerator Design for Deep Convolutional Neu-
ral Networks,” in ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2015, pp. 161–170.
[25] K. Yang, S. Wang, J. Zhou, and T. Yoshimura, “Energy-efficient schedul-
ing method with cross-loop model for resource-limited CNN accelerator
designs,” in IEEE International Symposium on Circuits and Systems
(ISCAS), May 2017, pp. 1–4.
[26] Y. Ma, Y. Cao, S. Vrudhula, and J. Seo, “Optimizing the Convolution
Operation to Accelerate Deep Neural Networks on FPGA,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, vol. 26,
no. 7, pp. 1354–1367, July 2018.
[27] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatin-
sky, J. Ragan-Kelley, A. Pedram, and M. Horowitz, “A systematic
approach to blocking convolutional neural networks,” arXiv preprint
arXiv:1606.04209, 2016.
[28] M. Peemen, B. Mesman, and H. Corporaal, “Inter-tile reuse optimization
applied to bandwidth constrained embedded accelerators,” in Design,
Automation Test in Europe Conference Exhibition (DATE), March 2015,
pp. 169–174.
[29] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing Loop Op-
eration and Dataflow in FPGA Acceleration of Deep Convolutional
Neural Networks,” in ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, 2017, pp. 45–54.
[30] M. Motamedi, P. Gysel, V. Akella, and S. Ghiasi, “Design space
exploration of FPGA-based Deep Convolutional Neural Networks,” in
21st Asia and South Pacific Design Automation Conference (ASP-DAC),
Jan 2016, pp. 575–580.
[31] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang,
and J. Cong, “Automated systolic array architecture synthesis for high
throughput CNN inference on FPGAs,” in 54th ACM/EDAC/IEEE De-
sign Automation Conference (DAC), June 2017, pp. 1–6.
[32] S. I. Venieris and C. Bouganis, “fpgaConvNet: A Framework for
Mapping Convolutional Neural Networks on FPGAs,” in IEEE 24th
International Symposium on Field-Programmable Custom Computing
Machines (FCCM), May 2016, pp. 40–47.
[33] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in IEEE Conference on Computer Vision and Pattern Recogni-
tion, 2016, pp. 4013–4021.
[34] M. Mathieu, M. Henaff, and Y. LeCun, “Fast Training of Convolutional
Networks through FFTs,” CoRR, vol. abs/1312.5851, 2013. [Online].
Available: http://arxiv.org/abs/1312.5851
[35] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN
accelerators,” in 49th IEEE/ACM International Symposium on Microar-
chitecture (MICRO), Oct 2016, pp. 1–12.
[36] K. Siu, D. M. Stuart, M. Mahmoud, and A. Moshovos, “Memory
Requirements for Convolutional Neural Network Hardware Accelera-
tors,” in IEEE International Symposium on Workload Characterization
(IISWC), Sep. 2018, pp. 111–121.
[37] J.-W. Hong and H. T. Kung, “I/O Complexity: The Red-blue Pebble
Game,” in Thirteenth ACM Symposium on Theory of Computing, 1981,
pp. 326–333.
[38] Q. Liu, “Red-Blue and Standard Pebble Games: Complexity and Ap-
plications in the Sequential and Parallel Models,” Ph.D. dissertation,
Massachusetts Institute of Technology, 2017.
[39] “Generalized mean .” [Online]. Available: https://en.wikipedia.org/wiki/
Generalized mean
[40] K. Goto and R. Van De Geijn, “High-performance Implementation of the
Level-3 BLAS,” ACM Trans. Math. Softw., vol. 35, no. 1, pp. 4:1–4:14,
Jul. 2008.
[41] X. Chen, J. Chen, D. Z. Chen, and X. S. Hu, “Optimizing Memory
Efficiency for Convolution Kernels on Kepler GPUs,” in 54th Design
Automation Conference, 2017, pp. 68:1–68:6.
[42] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
Learning with Limited Numerical Precision,” arXiv e-prints, Feb. 2015.
[43] N. P. Jouppi, A. B. Kahng, N. Muralimanohar, and V. Srinivas, “CACTI-
IO Technical Report ,” HP Laboratories, Tech. Rep., 2013. [Online].
Available: https://www.labs.hp.com/techreports/2013/HPL-2013-79.pdf
[44] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
