In current convolutional neural network (CNN) accelerators, communication (i.e., memory access) dominates the energy consumption. This work provides comprehensive analysis and methodologies to minimize the communication for CNN accelerators. For the off-chip communication, we derive the theoretical lower bound for any convolutional layer and propose a dataflow to reach the lower bound. This fundamental problem has never been solved by prior studies. The on-chip communication is minimized based on an elaborate workload and storage mapping scheme. We in addition design a communication-optimal CNN accelerator architecture. Evaluations based on the 65nm technology demonstrate that the proposed architecture nearly reaches the theoretical minimum communication in a threelevel memory hierarchy and it is computation dominant. The gap between the energy efficiency of our accelerator and the theoretical best value is only 37-87%.
I. INTRODUCTION
Convolutional neural networks (CNNs) have achieved great successes in numerous practical applications (e.g., [1] - [3] ). The reliable results produced by modern CNNs exclusively rely on the complex models and large amounts of data, which in turn bring significant demands in both performance and energy efficiency. Recently, a number of hardware accelerators based on either application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) have been proposed to boost the performance and the energy efficiency of CNNs.
Due to the large amount of data and complex data reuse patterns in convolution computation, CNN accelerators often involve a great number of memory accesses. Inputs and weights are typically stored in the off-chip dynamic randomaccess memory (DRAM). A static random-access memory (SRAM) based on-chip global buffer (GBuf) stores portions of inputs and weights which are loaded from the DRAM. Each processing element (PE) has some registers (Regs) to store inputs and weights which are read from the GBuf. Partial sums (Psums) are stored in the GBuf or Regs. During computation, there is complex data transmission in the memory hierarchy. Normally, communication, but not computation, dominates the energy consumption of a CNN accelerator. A DRAM access consumes 2 to 3 orders of magnitude higher energy than an arithmetic operation [4] and the DRAM access energy can be This paper will appear in 2020 26th IEEE International Symposium on High-Performance Computer Architecture (HPCA '20) . more than 90% of the total energy consumption of a CNN accelerator [5] , [6] . For the on-chip aspects, Regs can take up a large portion (>50%) of the chip energy while arithmetic units consume less than 20% [7] . Therefore, from an energy point of view, current CNN accelerators are communication dominant. Minimizing the communication, therefore, is the key for improving the energy efficiency of CNN accelerators.
Maximizing data reuse in convolutions helps reduce communication. Data reuse heavily depends on the convolutional dataflow. There are various approaches to optimize the dataflow: 1) designing an elaborate dataflow [5] , [7] - [18] , 2) selecting the best dataflow from several candidates [19] - [23] , and 3) design space exploration (DSE) [24] - [32] . A fair number of these studies focus on the performance and/or the energy efficiency of the computational components. The energy-dominant component, communication, has not been comprehensively investigated. Moreover, in most existing studies, the dataflow is designed based on intuitive/heuristic analysis, which may not guarantee the optimality.
If the inputs, weights, and outputs of a convolutional layer are accessed exactly once at every level of the memory hierarchy, the layer-wise minimum communication is obviously reached. However, such an ambitious goal requires a huge onchip memory. The requirement of memory resources varies for different applications. Thus, under given hardware resources, searching for a dataflow and an architecture that minimize the communication has much more practical significance. This problem has never been solved. In this work, we provide detailed analysis and methodologies to reach the lower bounds of both off-chip communication and on-chip communication. Specifically, we make the following contributions in this paper. • We solve a fundamental problem in CNN accelerators: what the lower bound of the off-chip communication of a convolutional layer is, if it is implemented on a CNN accelerator with a limited on-chip memory. We provide a mathematical derivation for this problem. • We demonstrate that convolutions have only one more level of data reuse (sliding window reuse) than matrix multiplications (MMs). Based on this conclusion, we elaborate a dataflow which fuses sliding window reuse and a communication-optimal MM implementation, to minimize the off-chip communication. • We propose a workload and storage mapping scheme such that both GBuf communication and Reg communication respectively reach their lower bounds. for (i = 0; i < B; i++) //Images in a batch for (oz = 0; oz < Co; oz++) //Output channels for (oy = 0; oy < Ho; oy++) //Output rows for (ox = 0; ox < Wo; ox++) //Output columns for (kz = 0; kz < Ci; kz++) //Input channels for (ky = 0; ky < Hk; ky++) //Kernel rows for (kx = 0; kx < Wk; kx++) //Kernel columns out [ • A communication-optimal CNN accelerator architecture is proposed, which not only reaches the minimum communication, but also can adapt to various convolutional layer dimensions with high resource utilization. The significance of this work is not purely on the proposed dataflow and/or architecture, but more importantly, from the point of view of a theoretical basis, to reveal the design methodology and principle to minimize the communication for CNN accelerators. Fig. 1 illustrates a general convolutional layer in CNNs. We have B input images in a batch and C O kernels of weights, producing B output images (only 1 input image and 1 output image are shown in Fig. 1 ). Each input image has C I channels and each output image has C O channels. The output channel dimension is H O ×W O . Each kernel is a C I ×H K ×W K 3D matrix. Each output is computed by an inner product between the inputs in a sliding window on the input image and the weights in a kernel. The stride size is the position difference between two adjacent sliding windows. Fig. 2 lists the pseudo code of a convolutional layer. It contains 7 levels of loops and assumes the unit stride size.
II. BACKGROUND

A. Convolutional Layers
From a quick glance of Fig. 2 , finding a dataflow with minimized communication is challenging, due to the huge search space caused by different loop orders, loop stride sizes, loop unrolling schemes, etc. There are several data reuse patterns in convolutions, including input reuse (InR, an input is used by multiple kernels), sliding window reuse (WndR, an input is used by overlapped sliding windows), weight reuse (WtR, a weight is used by multiple inputs), and output reuse (OutR, an output resides on chip during the entire computational process). Multiple data reuse patterns can be combined to form more complicated dataflows. Maximizing data reuse also involves a huge search space.
In this work, we only consider the ordinary convolution algorithm, which is the most popular approach adopted by hardware accelerators. Those convolution algorithms with lower computational complexity, such as the Winograd algorithm [33] and fast Fourier transform based approaches [34] , are not considered. We target at minimizing the communication of general convolution operations, so that our approach can be adopted in both inference and training of CNNs.
B. Related Work
A number of CNN accelerators are designed with an elaborate dataflow to optimize some objective(s) (e.g., performance, bandwidth, etc.) [5] , [7] - [18] . Unfortunately, their dataflows are designed almost based on intuitive/heuristic analysis. In other words, they claimed the superiorities of the dataflows and/or accelerators but failed to explain why the designs are essentially the best. Such designs may not guarantee the optimality. A representative state-of-the-art is Eyeriss [7] , [10] which claimed that the communication is optimized. We will show by experiments that neither its off-chip communication nor its on-chip communication is minimized.
Rather than using a single dataflow, several studies have integrated multiple dataflows into an accelerator (with increased hardware cost) and selected the best one according to the layer dimensions [19] - [23] . These approaches usually perform better than the approaches based on a single dataflow. However, the claimed optimality is only the best one among the given candidates. If the defacto best solution is not included in the candidates, they cannot find the optimal solution.
To find the optimal dataflow with a particular objective, a possible approach is to exhaustively consider all possible loop orders and tiling sizes (i.e., the stride sizes for the loops). This is the DSE approach [24]- [32] . However, the search space is so huge that an exhaustive search is extremely time-consuming. For instance, only considering two loops to minimize the offchip communication of a particular layer leads to an enormous search space of 7.2×10 13 [29] . Heuristics have to be adopted to find sub-optimal solutions. Exhaustive methods lack universality since they cannot tell people why the found dataflow is essentially the best. In this sense, for a new convolutional layer, re-conducting an exhaustive search is usually needed, as we do not know whether a known dataflow is still the best for the new layer.
In the aforementioned approaches, some studies ( [7] , [11] , [19] , [26] - [29] ) have considered communication optimization, while the others mainly focus on the computational components. Besides the three categories, there are other communication optimization approaches for CNN accelerators (e.g., the fused-layer approach [35] that optimizes data movement between convolutional layers). Currently, no study has comprehensively analyzed the lower bound of communication in CNN accelerators.
Ref. [36] analyzed the on-chip memory requirement such that both inputs and weights are read from the off-chip DRAM exactly once. This is the minimum possible off-chip communication. However, the required on-chip memory to achieve this goal is quite large (from several million bytes to hundreds of million bytes). On the other hand, the hardware resources are fixed but applications' requirements vary, so it is impossible to guarantee the goal all the time. In practice, searching for the minimum communication under given hardware resources has much more significance.
C. Preliminary: Red-blue Pebble Game
Our derivation for the lower bound of the off-chip communication heavily depends on the red-blue pebble game [37] , which is a theoretical model to estimate the minimum volume of data transmission between two levels of memories. The derived lower bound is the best possible, in the sense that it is achievable by certain algorithm implementations. Here we review an important theorem of the red-blue pebble game as a preliminary.
Suppose that the memory hierarchy comprises an unlimited slow memory and a limited fast memory. When optimizing the off-chip communication, they refer to the off-chip DRAM and the on-chip memory (e.g., SRAMs or Regs), respectively. The fast memory can hold only S data entries. An algorithm is described by a directed acyclic graph (DAG), in which each node represents a data entry or an operation (producing a data entry as the output of the operation) and each edge represents an inter-data dependency. We skip the definition of the original red-blue pebble game here because it is usually difficult to use [38] . Instead, the red-blue pebble game can equivalently be converted to an easier S-partition problem [37] , which is defined as follows.
Let G(V, E) be a DAG describing an algorithm, where V and E are the node and edge sets, respectively. A partition on G is called an S-partition, if the following four properties hold.
such that V i 's are disjoint and their union is V . • Property 2: there is no cyclic dependency among V i 's. • Property 3: for any V i (1 ≤ i ≤ h), there exists a dominator set D i (nodes in D i are not necessarily in V i ) such that |D i | ≤ S. A dominator set D i for V i is a set of nodes in V such that any path from an input of G to a node in V i contains some nodes in D i .
Let P (S) be the minimum number of subsets that any S-partition of a DAG must have. The following theorem describes the communication lower bound based on the Spartition model (the proof is provided in [37] ). Theorem 1. Given a fast memory of size S, to finish a DAG that describes an algorithm, the minimum communication volume Q between the fast memory and the slow memory satisfies 
III. LAYER-WISE LOWER BOUND OF OFF-CHIP COMMUNICATION
We now derive the layer-wise lower bound of the offchip communication, based on the S-partition model [37] and the relation between convolutions and MMs. Typically there are at least three levels in the memory hierarchy of a CNN accelerator: an off-chip DRAM, an on-chip GBuf, and Regs. The red-blue pebble game is still applicable. We define an effective on-chip memory as the maximum on-chip memory that does not contain duplicated data. For example, if the GBuf stores inputs and weights while some Regs store Psums (other Regs store inputs and weights that are copied from the GBuf), the effective on-chip memory refers to the GBuf (storing inputs and weights) plus those Regs which store Psums. A specific implementation may be a sub-optimum, since the redblue pebble game assumes a homogeneous on-chip memory without any specific splitting. Fig. 3 shows how to convert a convolutional layer into an MM. We first consider a simple case with batch size 1. The input image is unfolded to the input matrix, each row of which contains the inputs in a sliding window. Different rows in the unfolded input matrix correspond to different sliding windows, which also correspond to different locations on the output channels. All kernels are reshaped into a weight matrix, each column of which contains the weights of a kernel. The output image is reshaped into an output matrix, each column of which contains the outputs of an output channel. Reshaping means reorganizing the elements without adding or removing elements. If the batch size is B, we just stack up B unfolded input matrices and B output matrices, respectively, while the weight matrix remains unchanged. The stacked input and output matrices are still called unfolded input matrix and output matrix, respectively.
A. Relation Between Convolutions and MMs
The convolution-to-MM conversion is only logic equivalent but not algorithm equivalent. The difference is that, in a convolutional layer, inputs in overlapped sliding windows can be reused. This level of data reuse does not exist in MMs. This is why the input matrix is "unfolded" instead of "reshaped". In the conversion, the input images are unfolded by expanding all sliding windows, i.e., the common inputs in overlapped sliding windows have multiple explicit copies in the unfolded input matrix. We define R to denote the reuse number of each input by WndR, whose maximum value is
where D is the stride size. We will show that the derived lower bound of the off-chip communication relies on R.
One may argue that there are other data reuse patterns in a convolutional layer (e.g., InR, WtR, etc.). These data reuse patterns are actually included in the converted MM. For example, each column of the weight matrix can be shared by multiple rows in the unfolded input matrix, which is WtR, and each row of the input matrix can be reused by multiple columns in the weight matrix, which is InR. From the conversion process, it is clear that the computational process of a convolutional layer is not changed except for WndR, because except for that the input matrix is unfolded, the other matrices are just reshaped. This implies that, although a convolutional layer involves 7 levels of loops, it only has one more level of data reuse than MMs. In order to take into account WndR in the converted MM, we have defined R to denote the reuse number of each input by WndR. If R is 1 (i.e., no WndR), a convolutional layer is exactly equivalent to an MM. Since a fully-connected (FC) layer is also equivalent to an MM, our conclusion with R = 1 can be applied to FC layers. Note that the convolution-to-MM conversion is a only a logical operation used for our derivation. It is not a real operation in our dataflow or architecture.
B. Theoretical Derivation
Here we provide the theoretical derivation for the layer-wise lower bound of the off-chip communication. We consider a general case in which the on-chip memory cannot hold all inputs or all weights of a convolutional layer. Otherwise, it is just the ideal case (both the inputs and the weights are read exactly once).
Lemma 1. If a convolutional layer is represented by a DAG, the number of internal and output nodes in the DAG is
Proof: Fig. 4 illustrates a DAG that describes a convolutional layer. It has three levels. The first level comprises all input nodes, including inputs and weights. The second level is composed of all multiplication nodes. The last level is composed of all add nodes. The multiplication and add nodes associated with the same output form an add tree (multiplication nodes are also included in add trees). The detailed connections between the input nodes and the multiplication nodes are not shown because we are not interested in them.
There W K H K C I . Hence, there are W K H K C I multiplications nodes and W K H K C I add nodes in an add tree. Since no internal node can be shared by different add trees, the number of internal and output nodes is
In Fig. 4 , the inputs are marked as a 1 , a 2 , · · · , and the weights are marked as w 1 , w 2 , · · · . Each multiplication node produces a term a i w j . Note that the sum of multiple terms (e.g., a 1 w 1 +a 2 w 2 ) is not called a term. Instead, a term belongs to a sum (i.e., an add tree). We have the following lemma. Lemma 2. Let T (S) be the maximum number of terms that can be produced in no more than S add trees by using no more than S on-chip memory units. For a convolutional layer with each input reused by R times by WndR, T (S) = O(S √ RS).
Proof: The proof is based on the relation between convolutions and MMs. We use A, B and C to denote the unfolded input matrix, the weight matrix, and the output matrix, respectively. Then the MM is represented as AB = C, as shown in Fig. 5 . The produced terms using no more that S on-chip memory units can be arbitrarily distributed in C. Note that an element in C is the sum of multiple terms belonging to the said sum, so the produced terms may overlap in C. Without loss of generality, suppose that the produced sums form n rectangular blocks C 1 , C 2 , · · · , C n in C. Any two different blocks in C cannot overlap (if there are overlaps, we can always re-partition the blocks to eliminate overlaps). The size of C i is u i ×z i (the minimum is 1×1). Block C i is the product of two corresponding blocks in A and B, respectively, say A i and B i , which are of sizes u i ×k i and k i ×z i . Then each element in C i is the sum of k i terms. Note that different A i blocks may overlap and the same is for the B i blocks.
By the definition of T (S), all produced terms must be in no more than S add trees (namely, S sums), i.e.,
The elements in all A i and B i blocks must be in the on-chip memory, so the number of these elements cannot exceed S. However, since an element in A can be reused by at most R times by WndR, the actual number of required on-chip memory units for elements in A is reduced by a factor of R. This means that
where set OV contains all overlapped elements in all A i and B i blocks. Clearly, |OV | ≤ S. Hence, (4) becomes
According to the definition of T (S), we have
In what follows we will derive the maximum T (S) under the constraints defined in (3) and (5).
Let
Based on the generalized mean inequality [39] , we have
where the equality holds iff u i = Rz i for 1 ≤ i ≤ n. Now we have formulated a maximum value problem to maximize
under the constraints defined in (7) . By utilizing the Cauchy-Schwarz inequality [40] , we have
where the equality holds iff there exists a nonzero constant λ such that E i = λF i holds for 1 ≤ i ≤ n. Since n i=1 F 2 i is continuous and strictly convex, its maximum value (4S 2 ) is reached on the boundary. We have
where the rightmost equality holds on the boundary of the variables' value range, i.e., when there is only one i such that E i = S and F i = 2S, and all the other E i 's and F i 's are 0. In this case, the conditions for the equality of (10) (u i = Rz i for 1 ≤ i ≤ n) and (11) 
Without loss of generality, we assume that E 1 = S and F 1 = 2S, and all the other E i 's and F i 's are 0. In this case, we can derive that
The upper bound of T (S) is reached. This also implies that, to produce the most terms using limited on-chip memory units, the produced sums should be able to be merged into a single rectangular block (say,
can have at most 2T (S)+S internal and output nodes.
Proof: By Property 4 of the S-partition model, the output set of V i has at most S nodes. This implies that V i can have nodes in at most S add trees. To bound the internal and output nodes that V i can have, we only need to consider S add trees. By property 3 of the S-partition model, there is a dominator set D i for V i that has no more than S nodes. By the definition of T (S), from D i at most T (S) terms can be formed in S add trees. T (S) terms can form at most T (S) add nodes in V i . Considering that nodes in D i (|D i | ≤ S) can possibly be internal or output nodes of V i , V i can have at most 2T (S)+S internal and output nodes.
Based on Lemmas 1, 2 and 3, we can calculate the minimum number of subsets that any S-partition must have, which is
According to Theorem 1, we get the following theorem, which is also the key conclusion of this paper. 
The off-chip communication volume of a naive convolution implementation (without any data reuse) is
The lower bound reduces it by a factor of √ RS. If R is 1, then a convolutional layer is exactly equivalent to an MM. In this case, the reduction factor is √ S, which is consistent with the communication-optimal implementation of MMs [37] .
It is worth mentioning that the derived lower bound is in the form of Ω instead of a precise value. It represents the asymptotic relation between the off-chip communication and the on-chip memory capacity when the problem scale is large enough. It is possible that some dataflows can bring less offchip communication in some special cases (e.g., for small workloads).
IV. COMMUNICATION-OPTIMAL DATAFLOW
In this section, we elaborate our dataflow with minimized off-chip communication based on the above derivation. The on-chip communication is minimized based on a proposed workload and storage mapping scheme.
A. Dataflow with Minimized Off-Chip Communication
The dataflow with minimized off-chip communication is derived from the proof process of Lemma 2. More precisely, in Fig. 5 , the output matrix C is partitioned into equal-sized blocks of size u×z. The block size should satisfy u ≈ Rz and also meet the on-chip memory capacity. A block needs the data in the two yellow bands in the unfolded input matrix A and the weight matrix B. Actually, the communication-optimal implementation of MM is also the blocked method described in Fig. 5 [41] . When we map the blocked implementation back to a convolutional layer, we get Fig. 6 .
A block in C can be mapped to a z×y ×x (u = xy) 3D submatrix in the output images (e.g., the green block in Fig. 6 ). If the output channel dimension is too small (i.e., W O H O < xy), the said output sub-matrix may be from multiple (say, b) images in a batch. In this case, u = bxy. To compute the output sub-matrix, the inputs in the corresponding x × y (x = x + W K − 1 and y = y + H K − 1 if D = 1) locations from all input channels of b images (i.e., the yellow block in the input images) and z kernels associated with the partial output channels (i.e., the kernels colored yellow) are needed, as shown in Fig. 6 . Due to the limited on-chip memory, it might be impossible to load all required data at a time. Instead, it is computed by a series of iterations. In each iteration, in the yellow blocks, we load the inputs from a portion (say, k) of the input channels and the corresponding weights to the on-chip memory, shown by the red blocks in Fig. 6 . Then we can perform a partial update to the output sub-matrix. To complete the output sub-matrix, we continuously load inputs and weights in the yellow blocks and perform partial updates. For an output sub-matrix, the needed inputs and weights are read from the off-chip DRAM exactly once. Different output sub-matrices in the output images are computed sequentially in the same way. Fig. 7 lists the pseudo code of the dataflow. Any quadruple {b, z, y, x} (i.e., tiling sizes) defines an implementation of the dataflow. For a fixed quadruple {b, z, y, x}, k does not affect the off-chip communication. However, under a given on-chip memory capacity, smaller k results in larger output sub-matrices, and thus, less output sub-matrices. Hence, k should be the smallest value, namely, 1.
To explain why this dataflow is superior, we notice that it fully exploits OutR, since Psums reside on chip during the computational process and are written back to the off-chip DRAM only once. More importantly, it also takes into account other data reuse patterns at the same time, including InR (an input is reused by weights in z kernels), WtR (a weight is reused by b×x×y outputs) and WndR (an input is reused by at most R sliding windows on each x ×y plane). However, none of these data reuse patterns is fully utilized (for example, the loaded inputs are only reused by the loaded weights but not by all kernel weights). This implies that maximizing any single data reuse pattern is never the optimal solution. To sum it up, our dataflow fully exploits OutR and also combines InR and WtR in the best way.
We now verify that the proposed dataflow is able to achieve the lower bound of the off-chip communication. There are (BW O H O C O )/(bxyz) blocks in total in the output images. For each block, W K H K C I z weights and bx y C I inputs are needed. The DRAM read volume is 
If uz ≈ S (for minimizing the read volume) and W K H K C I √ RS 1 (for ignoring the write volume), (16) satisfies Theorem 2. This implies that, to reach the minimum off-chip communication, most of the effective on-chip memory should be assigned to Psums (since uz ≈ S). The fundamental principle behind this conclusion is to use the least inputs to produce the most outputs, implying that data reuse is maximized. In addition, for layers with few weights, the lower bound of (14) may not be tight, since W K H K C I √ RS 1 does not hold and the write volume
In fact, a few prior studies have more or less discussed similar dataflows [7] , [19] , [42] . However, they failed to find the superiorities of this dataflow due to the intuitive analysis and the lack of theoretical basis. Ref. [7] evaluated several OutR dataflows but the poor implementations brought ∼50% of the energy consumed by inter-PE communication which is actually unnecessary. Ref. [19] considered an OutR dataflow but the tiling sizes were not properly selected. The convolution implementation proposed in [42] is for graphics processing units rather than for hardware accelerators. Ref. [43] proposed a dataflow for CNN accelerators which explicitly converts any convolutional layer into an MM without exploiting WndR.
B. Workload and Storage Mapping with Minimized On-Chip Communication
Here we focus on the computation of an iteration (i.e., the red line in Fig. 7) . The required inputs and weights for an iteration have been loaded to the GBuf. The workload of an iteration is mapped to a PE array that consists of p×q PEs. We will introduce a workload and storage mapping scheme to minimize both GBuf communication and Reg communication.
1) Minimizing GBuf Communication: There is a major difference between optimizations of the off-chip communication and the on-chip communication. When minimizing the offchip communication, since the problem scale can be arbitrary but the hardware resources are fixed, tiling is necessary and the workload is finished by a number of sequential iterations. For an iteration, however, since the output sub-matrix size is limited by the on-chip memory capacity, it is possible to design the PE array size and the Reg capacity such that the hardware resources can handle the workload of an iteration at a time. This difference leads to a different lower bound -the loaded z s y'
x' inputs and weights (in the GBuf) can be read exactly once. This is no doubt the minimum possible GBuf communication. Without loss of generality, a PE is the smallest computational unit that has a multiplication-accumulation (MAC) unit. Like the dataflow to minimize the off-chip communication, each PE computes x s ×y s outputs in z s output channels, so each PE contributes to a z s ×y s ×x s (≥ (bxyz)/(pq)) block in the output sub-matrix, as illustrated in Fig. 8 . The produced outputs by p × q PEs should cover the reshaped output submatrix (bxy×z). Fig. 9 details the workload mapping for two PEs (PEs 1,1 and 1,2) and the workload of one PE. For one PE, x s y s k inputs and z s kW K H K weights are needed (remember that k = 1 in practice). However, we do not need to load them at a time. To enable WndR on each x s ×y s plane (see Fig. 8 ), x s ×y s inputs (for one PE) are loaded to the Regs. Since WndR cannot be applied to weights, we just load z s weights (for one PE) to the Regs. In an iteration, if updating all outputs once is called a pass (the ith pass computes the ith Psums of all outputs), in each pass, a PE uses x s y s inputs and z s weights to produces x s y s z s Psums (see the workload of one PE shown in Fig. 9 ). A pass needs x s y s z s clock cycles. The loaded inputs can be used for W K H K passes (because WndR is exploited in the Regs). The loaded weights can be used just for 1 pass, so a PE needs to load z s weights to the Regs in every pass. To complete an iteration, p×q PEs need kW K H K passes.
When considering the PE array, PEs in the same row share the loaded inputs and PEs in the same column share the loaded weights. As a result, each weight in the GBuf is read exactly once, reaching the minimum communication. The average read count of each input in the GBuf is (x s y s )/(x s y s ) which is larger than 1. The extra reads are from the halos (i.e., the inputs out of the x s ×y s rectangle but in the x s ×y s rectangle) on each input channel. It is possible to avoid reading extra halos by designing a complicated data transmission network, as an input in a block's halo is also an input of another block, such that each input in the GBuf is also read exactly once. We prefer reading extra halos as it simplifies the hardware design and regularizes the read patterns. Ideally, the Regs for storing inputs and weights can be global Regs (GRegs) instead of PEs' local Regs (LRegs) (for example, in Fig. 9 , the x s ×y s GRegs are shared by the first PE row). In practice, to avoid large fanouts and long latency of long wires, we partition the PE array into groups and each group shares a set of GRegs, with little extra Reg communication.
We choose to store Psums in PEs' LRegs. An alternative way is to store Psums in the GBuf, which reduces the Reg capacity. However, a Psum needs to be loaded to a Reg when it is being updated and stored back to the GBuf when updated, resulting in lots of data shuffling between the GBuf and Regs, and thus, high energy consumption. Hence, storing Psums in the GBuf is not energy efficient. Keeping Psums in Regs completely avoids GBuf access for Psums. Thus, the communication between the GBuf and Regs is minimized.
By utilizing our workload and storage mapping, the GBuf capacity can be reduced. Since weights are read row by row from the reshaped weight sub-matrix and inputs are read column by column from the reshaped input sub-matrix (see Fig. 9 ), we do not need to load kW K H K × z weights and bx y ×k inputs to the GBuf at a time. Instead, we only need one row of SRAMs for weights and one column of SRAMs for inputs. Once data in the GBuf are loaded to the GRegs, the GBuf is used for prefetching data for the subsequent pass.
2) Minimizing Reg Communication: Psums are stored in PEs' LRegs. Since each MAC operation needs a Reg write, the minimum number of Reg writes is the number of MAC operations, i.e.,
This is no doubt the minimum Reg communication. Keeping Psums in LRegs naturally reaches this lower bound, which minimizes the dynamic energy of LRegs. On the other hand, the static energy of LRegs should also be optimized. Suppose that each PE has r (≥ x s y s z s ) LRegs to store Psums. For a PE, in each cycle, at most one Reg is written and the other r−1 Regs just consume static energy. If r is large, the static energy consumption of the Regs may dominate the total Reg energy. Increasing the PE array size (i.e., pq) can reduce r, with increased arithmetic component power. However, with more PEs, the execution time is reduced so that the energy of the arithmetic components almost keeps unchanged. From an energy point of view, using more PEs causes lower static energy consumption of the Regs, though the arithmetic power dissipation will increase.
Using GRegs to share inputs and weights to the PE array completely avoids inter-PE communication. Duplicating inputs and weights from the GBuf to GRegs brings little extra Reg communication. Thus, the Reg communication is minimized.
C. Summary
We summarize the communication lower bound here. The lower bound of the off-chip communication is defined in (14) . The lower bound of the GBuf communication is the off-chip communication of inputs and weights. The lower bound of the Reg communication is defined in (17) . There are two key equations to achieve the lower bound: bxy ≈ Rz (for setting the tiling sizes) and bxyz ≈ S (most of the on-chip memory capacity should be assigned to Psums).
The superiorities of our dataflow and workload and storage mapping scheme come from three aspects. First, our dataflow and workload mapping scheme fully exploit OutR and also combine InR and WtR in the best way, which is actually a combination of a communication-optimal MM implementation and WndR. The optimal dataflow and workload mapping scheme help reduce both DRAM communication and GBuf communication. Second, the concurrency of PEs is exploited to share inputs and weights by GRegs. Third, Psums are stored in PEs' LRegs. The last two points both help reduce GBuf communication and Reg communication. Our approach can practically reach the minimum communication in a three-level memory hierarchy.
V. COMMUNICATION-OPTIMAL CNN ACCELERATOR ARCHITECTURE
In this section, we propose a CNN accelerator architecture with minimized communication, based on the theoretical conclusions of the previous section. According to the implication of (16), most of the effective on-chip memory should be assigned to Psums to minimize the off-chip communication. We use an example containing 64KB Psums and p×q = 16×16 PEs to describe the design methodology of our CNN accelerator. We use 16-bit fixed-point arithmetic units, so there are 32K (32768) entries for Psums and each PE has 128 entries.
Based on the workload and storage mapping scheme illustrated in Fig. 9 , we design our architecture as shown in Fig. 10 . The architecture mainly comprises a PE array, GRegs, two GBufs (an input GBuf (IGBuf) and a weight GBuf (WGBuf)), a controller, and some first-in first-out (FIFO) buffers that connect the off-chip DRAM and the on-chip memories.
GBufs: According to the discussions of Section IV-B1, to avoid long wires, the PE array is partitioned into PE groups and each PE group (p g ×q g PEs) shares a set of GRegs (see Fig. 10 ). In our example, p g =q g =4. All GReg rows (columns) store the same weights (inputs), and the same position in all GReg rows (columns) is written at the same time.
We discuss how to determine the sizes of the GBufs. Remember that most of the effective on-chip memory should be assigned to Psums (i.e., S ≈ 32768) and the tiling sizes {b, z, y, x} should satisfy bxy ≈ Rz to minimize the off-chip communication. If R = 1 (i.e., no WndR), bxy ≈ z ≈ 181. This is the approximate maximum value of z, so we set the size of the WGBuf to 256 entries (0.5KB). With larger R, bxy also becomes larger. Considering that the maximum R is typically 9 (W K = H K = 3 and D = 1, see (2)), the maximum bxy is 543. Since the IGBuf should store bx y (slightly larger than bxy) inputs from b y ×x input channel planes (see Fig. 8 ), we set the size of the IGBuf to 1024 entries (2KB). We leave some extra entries in the GBufs to adapt to various tiling sizes. Even so, the GBuf capacity is still very small. Once data in the GBufs are loaded to the GRegs, the GBufs are used for prefetching inputs and weights for the subsequent pass. The prefetching is (partially) overlapped with computation.
Inputs and weights stored in the GBufs are just in the order as in the reshaped input and weight sub-matrices (see Fig. 8 ). This is the natural order when loading them from the DRAM. No special order is needed. Inputs are not unfolded so we can exploit WndR on chip.
GRegs: A GReg row (storing weights) is shared by p g PE rows so p g × q (4 × 16) PEs share a GReg row. Data stored in each GReg row are copied from the WGBuf. To adapt to different z values, we elaborate a multiplexer (MUX) structure, as shown in Fig. 11 . There are q (16) 256 q -to-1 (16-to-1) weight MUXes connecting the WGBuf and the q (16) PE columns. Slightly different from the workload mapping shown in Fig. 9 , here the z s channels computed by a PE is not consecutive but have a stride size q (16) . The inputs of the q (16) weight MUXes are arranged in a round-robin way, so that the input range exactly covers all entries of the WGBuf. To adapt to different z and z s values, we just control the selection signals of the weight MUXes. For instance, if z = 64 (so z s = 4), the selection signals of the weight MUXes are from 0 to 3, so that only the first 64 entries of the WGBuf can be selected. Such a weight MUX structure avoids the use of a complicated data transmission network (e.g., a network-on-chip).
To exploit WndR in the GRegs, each GReg column (storing inputs) is partitioned into p (16) segments. A GReg segment has 64 entries and is shared by 1×q g (1×4) PEs. Each GReg segment loads x s y s inputs (see Fig. 9 ) from the IGBuf. The x s y s inputs can be used in W K H K passes to compute x s y s z s Psums. Each GReg segment has a 64-to-1 MUX to provide inputs to the 1×q g (1×4) PEs. The selection signals of the input MUXes are from 0 to x s y s −1 so that only the first x s y s entries of the GReg segments can be selected.
PEs: A PE comprises a MAC unit and a set of LRegs (128 entries) for Psums. Our architecture does not need LRegs in each PE to store inputs or weights. A PE computes a Psum and writes the accumulated result to an LReg in each cycle. All PEs operate synchronously. This means that, at the same moment, the selection signals of all input MUXes are identical, the selection signals of all weight MUXes are identical, and the read and write positions of all LRegs are also identical.
Controller: Our architecture has a global controller, which schedules the computational process. It is a finite-state machine that generates control signals for all components, including the read/write signals and addresses of all memories and the selection signals of all MUXes. No local controller is needed in each PE.
VI. EXPERIMENTAL RESULTS
Our CNN accelerator is implemented in Verilog. We synthesize it with Design Compiler based on the 65nm technology. We use Memory Compiler to generate the GBufs. The power dissipation is evaluated with PrimeTime. CACTI [44] is employed to evaluate the latency and energy consumption of a 2GB DDR3 DRAM (the peak bandwidth is 6.4GB/s). The core frequency is 500MHz and the DRAM frequency is 100MHz. A cycle-accurate simulator is built to evaluate the performance with memory access latency taken into account. The representative state-of-the-art, Eyeriss [7] , [10] , is the baseline for comparison (detailed off-chip and on-chip communication volumes are reported in [10] ). The workload is VGGNet-16 [45] with batch size 3, the same as the workload used in [10] . VGGNet has diverse layer dimensions, including large/shallow layers, small/deep layers, and layers with medium size/depth.
We evaluate five implementations of our accelerator with different PE numbers and on-chip memory sizes, as listed in Table I . Table II lists the energy consumption of the basic operations, estimated by our simulations. 
InR-C InR-B InR-A OutR-A
OutR-B an x*y plane CO outputs k*y*x inputs k input channels CI*y*x inputs z*k*WK*HK weights z kernels 
A. DRAM Access Volume
We compare our dataflow with other dataflows based on different data reuse patterns, as shown in Fig. 12 , in which the colored blocks reside on chip for reuse. For example, in InR-A, a k × y × x block resides on chip for reuse, while the associated weights and outputs are shuffled on and off chip when necessary. These dataflows should cover the most popular ones used in literature. For example, ShiDiaoNao [12] uses OutR-A. Fig. 13 compares the DRAM access volume under different effective on-chip memory sizes. The lower bound is calculated by (16) . To make a fair comparison and to remove the impact of improper tiling sizes, the tiling sizes of all dataflows are obtained by exhaustive searches (since the loop order is fixed, searching for the best tiling sizes is fast, typically shorter than 0.1s). The found minimum is obtained by searching for the best dataflow with the best tiling sizes for each layer. Fig. 13 demonstrates that our dataflow produces almost the same DRAM access volume as the found minimum, and the difference is only 4.5% on average. To understand why our dataflow does not produce the least DRAM access volume for all layers, we have mentioned at the end of Section III that the derived lower bound is in the form of Ω instead of a precise value. However, despite that, it is unnecessary to select the best dataflow from multiple candidates, as the expected improvement in the DRAM access volume is less than 5%. Our dataflow produces 10% more DRAM access volume on average than the theoretical lower bound. The 2nd and 3rd best dataflows, InR-A and WtR-A, respectively produce 45.1% and 45.8% more DRAM access volume than ours. Fig. 14 shows the per-layer DRAM access volume of the lower bound, our dataflow, our implementations 1-3, InR-A, and WtR-A. The difference between our dataflow and our implementation is that the latter has a fixed on-chip 80   100   1  2  3  4  5  6  7  8  9  10  11  12  13 DRAM access volume (MB) memory splitting (e.g., 64KB Psums plus 2.5KB GBufs in our implementations 1-3). Due to this reason, our implementations 1-3 produce 3-4% more DRAM access than our dataflow, indicating tiny impacts of the fixed on-chip memory splitting. Our dataflow and implementations produce balanced input and weight access volumes, while outputs take up a small portion of the DRAM access volume. For InR-A and WtR-A (the 2nd and 3rd best dataflows), outputs involve a large portion of the DRAM access volume, and the input and weight access volumes are not balanced, leading to much larger memory access volumes. We try to make an apple-to-apple comparison with published data but find it difficult. Ref. [10] reported the DRAM access volume of VGGNet-16 with input compression on Eyeriss. Ref. [19] selected the best dataflow with the minimum DRAM access volume from three candidates. Inputs, weights, and outputs are pruned in [19] . Our work targets at general CNN accelerators without pruning/compression, so the results reported in [10] , [19] are not directly comparable to ours. So we try to make an approximate comparison.
Convolutional layer index
Eyeriss has a 108KB GBuf but the effective on-chip memory capacity is 173.5KB, since 100KB of the GBuf stores inputs and outputs (the other 8KB is used for prefetching weights), while weights are stored in PEs' local SRAMs (each PE has 448B local SRAMs) [10] . Under the 173.5KB effective on-chip memory limit, we compare our dataflow and Eyeriss with and without input compression, as shown in Fig. 15 and Table III . Ref. [10] has reported the per-layer input compression ratios of VGGNet-16 but the proportion of the input access volume in the total access volume is not reported. We use the proportion of our dataflow to evaluate the off-chip DRAM access volume for Eyeriss without input compression. Our dataflow reduces 43.3% DRAM access volume than Eyeriss without input compression. Our dataflow even produces 6.7% less DRAM access volume than Eyeriss with input compression. We notice from Fig. 15 that for layer 1, Eyeriss produces a lower DRAM access volume than the lower bound. This is because the derived lower bound is in the form of Ω instead of a precise value. It represents the asymptotic relation between the off-chip communication volume and the on-chip memory capacity when the problem scale is large enough. Special cases exist for small workloads.
DRAM access volume (MB)
Convolutional layer index
Compared with FlexFlow [22] (64KB GBuf and 512B/PE local storage) which selects the best dataflow from several candidates, the DRAM access/MAC metric of our implementation 1 (2.5KB GBuf and 512B/PE local storage) is 33% better (0.0033 vs. 0.0049). Fig. 16 shows the GBuf access volume of our accelerator and the comparison with Eyeriss. Our implementations (with smaller total and effective on-chip memory capacities) produce much less GBuf communication than Eyeriss, and the reduction factors are 10.9-15.8×. The large reduction is due to the elimination of data shuffling between the GBuf and LRegs.
B. GBuf Access Volume
To understand how our accelerator reaches the minimum GBuf communication, we list the DRAM and GBuf access volumes of implementation 1 in Table IV . For weights, the GBuf read and write volumes respectively equal to the DRAM read volume, reaching the theoretical lower bound. For inputs, the GBuf write volume is slightly larger than the DRAM read volume, because the tiling-based dataflow causes some input or output blocks out of the input or output boundaries, resulting in a few redundant GBuf writes. The GBuf read volume for inputs is 1.67× of the DRAM read volume for inputs. The extra reads are from the halos of convolution inputs, which is explained in Section IV-B1. The GBuf read and write volumes are respectively 1.33× and 1.07× of the DRAM read volume, indicating that our accelerator roughly reaches the theoretical lower bound of the GBuf communication.
C. Reg Access Volume
Fig . 17 shows the Reg access volume of our accelerator and the comparison with the lower bound. The lower bound is calculated from (17) . The Reg access volume of our accelerator is only 5.9-11.8% larger than the lower bound, indicating that our accelerator almost reaches the theoretical lower bound of the Reg communication. The extra Reg communication is from a) the GReg communication, and b) Psums that are out of the output boundary caused by the tiling-based approach.
We are not able to make a numerical comparison with any existing CNN accelerator on the Reg communication since no similar result was found. For an intuitional comparison with Eyeriss (and other accelerators which propagate data in the PE array, e.g., [17] , [18] ), our architecture is expected to reduce the Reg communication severalfold, because Eyeriss not only writes Psums to Regs in each cycle (which our accelerator also has), but also propagates inputs, weights, and Psums in the PE array (which our accelerator does not have). (# of MACs) writes). The lower bound describes the essential energy consumption to complete the MAC operations. MAC operations and Regs dominate the energy consumption of our accelerator. Our accelerator almost reaches the lower bound for DRAM communication and MAC operations. For the Reg energy, our accelerator brings higher energy than the lower bound. The extra Reg energy is mainly due to the static energy consumption of the LRegs. With fewer LRegs in each PE, the Reg energy consumption is decreased. Even so, MAC operations take up the largest portion in the total energy consumption, implying that our accelerator is computation dominant. The gap between the energy efficiency of our implementations and the best value is only 37-87%, indicating that our accelerator roughly reaches the best energy efficiency. According to the measured data reported in [10] , the energy efficiency of Eyeriss with input compression and zero gating is 22.1pJ/MAC (for on-chip aspects). As a direct numeric comparison, our accelerator (by simulations) without data compression or gating is 2.61-3.68× more energy efficient than Eyeriss for on-chip aspects. Fig. 19 shows the performance and power dissipation of our accelerator. With more PEs, the execution time is reduced and the power is increased. The proportion of waiting time increases with more PEs. With reduced computational time, the memory access latency cannot be fully overlapped by computation so it affects the execution time. Compared with Eyeriss, our five implementations achieve 9.8-42.3× performance gain, with memory access latency taken into account. workload of each PE. Since the LRegs dominate the on-chip memories, the overall memory utilization is also high (80.6-91.0%). The PE utilization keeps very high (>97%). In fact, all PEs are busy in our implementations. The small quantity of useless PE workload is caused by the tiling-based approach.
D. Energy Efficiency
E. Memory and PE Utilizations
VII. CONCLUSIONS In current CNN accelerators, communication dominates the energy consumption and consumes much more energy than computation. In this work, we provide the theoretical lower bounds of both off-chip communication and on-chip communication. Based on the theoretical results, we elaborate our communication-optimal dataflow as well as a communicationoptimal accelerator architecture. We demonstrate by both theoretical analysis and experimental results that our dataflow and architecture are able to practically reach the minimum communication in a three-level memory hierarchy. Our CNN accelerator is computation dominant and the energy efficiency is close to the theoretical best value.
