NP-CGRA: Extending CGRAs for Efficient Processing of Light-weight Deep Neural Networks by Lee, Jungi
 
 
저 시-비 리- 경 지 2.0 한민  
는 아래  조건  르는 경 에 한하여 게 
l  저 물  복제, 포, 전송, 전시, 공연  송할 수 습니다.  
다 과 같  조건  라야 합니다: 
l 하는,  저 물  나 포  경 ,  저 물에 적 된 허락조건
 명확하게 나타내어야 합니다.  
l 저 터  허가를 면 러한 조건들  적 되지 않습니다.  
저 에 른  리는  내 에 하여 향  지 않습니다. 




저 시. 하는 원저 를 시하여야 합니다. 
비 리. 하는  저 물  리 목적  할 수 없습니다. 
경 지. 하는  저 물  개 , 형 또는 가공할 수 없습니다. 
Master’s Thesis
NP-CGRA: Extending CGRAs for Efficient
Processing of Light-weight Deep Neural Networks
Jungi Lee
Department of Electrical Engineering
Ulsan National Institute of Science and Technology
2021
NP-CGRA: Extending CGRAs for Efficient
Processing of Light-weight Deep Neural Networks
Jungi Lee
Department of Electrical Engineering




Coarse-grained reconfigurable architectures (CGRAs) can provide both high energy efficiency
and flexibility, making them well-suited for machine learning applications. However previous
work on CGRAs has a very limited support for deep neural networks (DNNs), especially for
recent light-weight models such as depthwise separable convolution (DSC), which are an impor-
tant workload for mobile environment. In this paper, we propose a set of architecture extensions
and a mapping scheme to greatly enhance CGRA’s performance for DSC kernels. Our experi-
mental results using MobileNets demonstrate that our proposed CGRA enhancement can deliver
8∼18× improvement in area-delay product depending on layer type, over a baseline CGRA with
a state-of-the-art CGRA compiler. Moreover, our proposed CGRA architecture can also speed
up 3D convolution with similar efficiency than previous work, demonstrating the effectiveness
of our architectural features beyond DSC layers.

Contents
I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1 Baseline CGRA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 CGRA Architecture Exploration . . . . . . . . . . . . . . . . . . . . . . . 2
2.3 DPU Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
III NP-CGRA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 CGRA Performance Bottleneck Analysis . . . . . . . . . . . . . . . . . . . 3
3.2 Our Proposed Architecture Extension . . . . . . . . . . . . . . . . . . . . 3
3.3 Instruction Format and Global Configuration . . . . . . . . . . . . . . . . 6
IV Application Mapping for NP-CGRA: DWC Case . . . . . . . . . . . . . . . . . . 7
4.1 Depthwise Convolution with Arbitrary Stride . . . . . . . . . . . . . . . . 7
4.2 Depthwise Convolution with S = 1 . . . . . . . . . . . . . . . . . . . . . . 8
V Address Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.1 Pointwise Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 Depthwise Convolution with Arbitrary Stride . . . . . . . . . . . . . . . . 14
5.3 Depthwise Convolution with S = 1 . . . . . . . . . . . . . . . . . . . . . . 15
5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
VI Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.2 Depthwise Separable Convolution Results . . . . . . . . . . . . . . . . . . 19
6.3 Hardware Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 20
6.4 Comparison with Previous Work Using MobileNet . . . . . . . . . . . . . 22
6.5 AlexNet Convolution Layer Results . . . . . . . . . . . . . . . . . . . . . . 23
VII Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
List of Figures
1 Mapping PWC (or matrix mult.) to a 2×2 CGRA. . . . . . . . . . . . . . . . . . 4
2 Proposed PE architecture (our extension shown in red). . . . . . . . . . . . . . . 5
3 Instruction definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4 Extended CGRA architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5 Mapping DWC on a 2×2 CGRA (K = 3, S = 2). . . . . . . . . . . . . . . . . . . 9
6 Schedule and data movement for DWC (S = 1). . . . . . . . . . . . . . . . . . . . 9
7 Data access patterns in DWC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Mapping DWC with stride 1 on a 2×2 CGRA. . . . . . . . . . . . . . . . . . . . 12
9 PWC IFM data in external memory and H-MEM. . . . . . . . . . . . . . . . . . 13
10 DWC2 IFM data in external memory and H-MEM. . . . . . . . . . . . . . . . . . 14
11 IFM data in external memory and in V-MEM for DWC (shown in red is a tile). . 17
12 Area comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
List of Tables
1 Theoretical min latency (ms, sum of 7 DWC layers) . . . . . . . . . . . . . . . . . 3
2 Parameters and variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 NP-CGRA specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 MobileNet DSC result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 Comparison with previous CGRA and DPU implementations . . . . . . . . . . . 21
I Introduction
Recently a number of DNN (Deep Neural Network) processing units, or DPUs, have been pro-
posed, which can be classified as soft DPUs (implemented on an FPGA) and hard DPUs (fab-
ricated into a chip) [1]. Hard DPUs such as TPU [2] can have higher performance and energy
efficiency but lack flexibility, and may be difficult to support future application changes. Soft
DPUs such as BrainWave [1] can be easily upgraded, but typically have much lower performance
per cost and energy efficiency compared to hard DPUs. CGRAs can strike a balance between en-
ergy efficiency and flexibility, such as supporting new activation functions (e.g., leaky ReLU [3])
and skip connections. Also CGRAs can be utilized for other applications than DNNs.
Previous work on mapping DNNs to CGRAs includes new architectures [4,5] and a new com-
pilation method [6], but they all target 3D convolution only (such as used in AlexNet [7]) [4–6].
However, for mobile applications, conventional 3D convolutions are superseded by light-weight
models exploiting depthwise separable convolution (DSC) such as MobileNets [8,9] due to their
significantly higher inference performance and greatly reduced model size and computation
complexity. Depthwise separable convolution is realized as a combination of depthwise con-
volution (DWC) and pointwise convolution (PWC) layers. While PWC typically accounts for
over 90% MAC operations, in terms of runtime DWC can account for up to 40% due to its low
computation-to-data-transfer ratio and difficulty in mapping DWC. Hence it is important to
provide optimized mapping for DWC as well as PWC.
In this paper we first present our analysis showing that CGRAs are not necessarily slower
than hard DPUs when it comes to machine learning workload, if a right set of architectural
features are provided. Based on the analysis, we present three generic architecture extensions for
CGRAs—crossbar-style memory bus, dual-mode MAC (multiply-accumulate) unit, and operand
reuse network—along with a mapping scheme that can greatly enhance CGRA’s performance
for DSC kernels.
Our experimental results using MobileNet V1 and V2 [10,11] demonstrate that our proposed
features can improve the efficiency of CGRA for DWC and PWC layers by 8 and 18×, respec-
tively, in terms of area-delay product (ADP) over a compiler approach [12]. Moreover, though
not explicitly optimized for, 3D convolution on our architecture is also quite efficient, generating
competitive performance and ADP as a CGRA [5] explicitly optimized for machine learning
algorithms including 3D convolution.
In this paper we make the following contributions. First we analyze the performance bot-
tleneck of CGRAs for DNN acceleration. Second we propose a small set of generic architecture
extensions and a mapping scheme for DWC and PWC kernels. Third we evaluate our proposed




While CGRA is a generic term encompassing many different architectures [4–6,12–14], we con-
sider ADRES-like CGRAs [14] as our baseline, which have been most extensively studied. The
main datapath consists of a 2D array of PEs (Processing Elements) interconnected with a mesh-
like network, plus local memory implemented as multi-banked SRAM blocks for high on-chip
bandwidth. PEs can perform arithmetic/logic and memory operations though details vary. The
PE operations and inter-PE connections are dynamically reconfigurable with no runtime over-
head, thereby supporting pipelining of loops with II (Initiation Interval) greater than 1.
There are two kinds of memory operations on CGRAs in the literature: addressed vs streamed
load-store. Addressed load-store [14] is more common among CGRA compilers as it supports ran-
dom memory access, but requires explicit address computation (which uses PE cycles). Streamed
load-store [13] requires dedicated AGUs (Address Generation Units), which support only a lim-
ited set of access patterns. In either case, it is possible for all connected PEs to simultaneously
read a memory bus if needed.
2.2 CGRA Architecture Exploration
CGRA architecture exploration has been performed in [15], which however does not take into
account DNN workload or specific mapping schemes. While single-cycle MAC operation is
common in DPUs, it is rarely supported on CGRAs by default. Our dual-mode MAC is con-
figurable at the application granularity to minimize cycle time impact of operation chaining.
An extreme version of operation chaining has been proposed [16] in order to accelerate nar-
row acyclic subgraphs at subcycle granularity, which however complicates datapath, control,
and compiler scheduling significantly. Our operand reuse network is an input-to-input network
whereas operand networks in the literature [17, 18] generally refer to output-to-input networks.
2.3 DPU Optimization
DSC computation has been targeted by both hard DPUs [11] and soft DPUs, but not by CGRAs.
We do not consider pruning [19] directly in this work, since DSC is already a form of sparsity at
coarse granularity [20] while being much more amenable to hardware parallelization than fine-
grained sparsity. We also do not consider aggressive quantization, but the width of datapath is
trivially configurable at design time.
2
Table 1: Theoretical min latency (ms, sum of 7 DWC layers)
Architecture Compute time L1 transfer Layer latency
CGRA baseline (4x4) 1.68 0.75∼4.10 1.68∼4.10
CGRA enhanced (8x8) 0.21 0.19 0.21
Eyeriss (168 PEs) 0.20 0.23 0.23
III NP-CGRA Architecture
3.1 CGRA Performance Bottleneck Analysis
Table 1 compares a baseline CGRA [6] with Eyeriss [10], a reference hard DPU, in terms of
minimum theoretical latency, using 7 DWC layers from MobileNet V2, one from each bottleneck
(we see similar results with other layers as well). The baseline CGRA has 4x4 PEs and runs
at 500 MHz with 4-byte word size, and Eyeriss has 168 PEs and runs at 200 MHz with 2-byte
word size.
We calculate minimum theoretical latency simply by the max of compute time (assuming
100% PE utilization), L1 transfer time (i.e., on-chip memory access latency), and external
memory DMA (Direct Memory Access) time, the last of which is very small for all the cases
compared, and not shown. To estimate L1 transfer time for the baseline CGRA, we assume all
4 load-store units (one per row) are 100% utilized, and consider two scenarios: the least and
most data reuse of IFM (Input Feature Map). For Eyeriss we assume 32 load-store units, and
most data reuse.
Our result suggests that there is ∼8× compute time difference between the baseline CGRA
and Eyeriss DPU even if we assume 100% PE utilization, which may be harder for CGRA. The
difference grows if CGRA fails to reuse IFM data optimally.
To fill the gap, we consider CGRA enhanced, which is the same CGRA but with 8x8 PEs
and 2-byte word size. Also the PEs of CGRA enhanced can do MAC operation in a single
cycle like Eyeriss (CGRA baseline can do either MUL or ADD, not both). These changes can
bring compute time to Eyeriss-level, but layer performance would still suffer due to L1 transfer
bottleneck. To make it compute-bound, CGRA enhanced needs to have 16 load-store units, one
per row or column, and the most data reuse scenario.
To summarize, our analysis suggests that CGRA is capable of delivering hard DPU-level
performance, but needs a few major changes: single-cycle MAC, larger array size, at least 2×
on-chip memory bandwidth, and extremely high PE utilization.
3.2 Our Proposed Architecture Extension
Our driving application is pointwise convolution (PWC), which is also known as 1x1 convolution
and algorithmically equivalent to matrix multiplication. While one can use a CGRA compiler
3
PE Array
IFM (Input Feature Map) Data
Weight




























X1,0 X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8
X0,0 X0,1 X0,2 X0,3 X0,4 X0,5 X0,6 X0,7 X0,8
Figure 1: Mapping PWC (or matrix mult.) to a 2×2 CGRA.
(e.g., [21,22]) to compile matrix multiplication for a CGRA, it would yield a vastly suboptimal
schedule. In case of matrix multiplication, it is straightforward to find an optimal schedule
manually, if one is allowed to modify the architecture slightly. The most critical architectural
change is crossbar-type memory busses, as opposed to parallel busses.
Crossbar-style Memory Bus
Fig. 1 illustrates our proposed mapping for a 2×2 CGRA, in which we use the first 2 rows of
one source matrix (X) and the first 2 columns of the other source matrix (W), to generate the
top-left 2×2 submatrix of the result matrix. The result submatrix is generated on the 2×2 PEs
through a series of MAC operations (thus output stationary), as indicated by the schedule.
To provide the four PEs with correct operands, all we need is two horizontal busses and
two vertical busses. Note that the data on a bus can be accessed by all connected PEs, and
we add only vertical busses; horizontal busses already exist. For instance, PE0,0 and PE0,1 can
access the same X(0, i) at cycle i (0 ≤ i ≤ 8) through a horizontal bus (called H-bus), and
similarly, PE0,0 and PE1,0 share W(i, 0) through a vertical bus (V-bus). To use all PEs for
MAC operations, streamed load-store is necessary. This mapping achieves 100% PE utilization,
each PE performing MUL (multiplication) and ADD (addition) operations every cycle, given


























































Figure 2: Proposed PE architecture (our extension shown in red).
Dual-mode MAC
In most CGRAs a PE performs only one operation per cycle, either MUL or ADD, which is fine
if they are used intermittently. We propose configurable chaining of MUL and ADD operations,
which can reduce PWC latency to half, though it may also increase cycle time. We make
chaining configurable at the application granularity, so that higher clock speed is selected if the
application does not use MAC chaining. We call this dual-mode MAC. A detailed diagram of
dual-mode MAC is omitted due to page limit, but it is straightforward to design one.
Operand Reuse Network
To make it easy to realize spatial data reuse on CGRAs we propose operand reuse network, which
enables input-to-input routing as opposed to output-to-input routing. Consider an FIR filter
example: yi ← w0xi+w1xi+1+w2xi+2, where i is the index variable of a loop that is pipelined.
One way to map this loop to a CGRA is to place output variables yi to different PEs (i.e., yi
to PEi), called output stationary, and route input and coefficients to PEs. In this scheme the
same input data is used by multiple PEs at different cycles (e.g., x2 is used by PE0, PE1, and
PE2 at consecutive cycles). Thus operand reuse network allows one of the source operands of
a PE (i.e., the output of an input MUX) to be passed to neighbor PEs without affecting other
computation that PEs may be doing, as illustrated in Fig. 2.
While a weight stationary scheme could realize spatial data reuse without operand reuse
network, it cannot easily utilize more PEs than the number of weight parameters. Also, the











































Figure 3: Instruction definition.
3.3 Instruction Format and Global Configuration
As the CGRA PE structure was changed, the bit width of the instruction increased. We use R
type 32bit instruction from CCF framework instruction format [21]. If we don’t consider R, P
type of instruction, we can use this instruction by 31 bit. Fig. 3 shows our instruction format.
Reg a, b indicates index of register file which applies to muxA and muxB. Wr-en means write on
register file enable bit. Wr-reg means the index of register file when write on register file. Wr-op
determines which data will be written on the register where from output register of itself or
the neighbor output of PE muxA. In-op is the bit that determines which muxA of the neighbor
PE is to be written on register file. AB is the bit for sending read requests to memory using
output as address, and DB is the write request bit for storing output in memory. Our instruction
format needs 36 bits. Op, muxB, wr-op requires an additional 1 bit each, and in-op requires an
additional 2 bits. In the configuration memory, 4 bits need for index of global register file and
2 bits for H and V memory read requests. So bit of the configuration memory is calculated as
36 × Num of CGRA PEs + 6 format.
Other Changes
Fig. 4 shows our extended CGRA architecture. Our CGRA architecture has vertical memory(V-
MEM), memory access module(MAM), global register file, etc. Also we change PE structure for
dual mode mac and operand reuse network.
The crossbar-style memory bus implies that the local memory should be divided into two,
V-MEM connected to V-bus and H-MEM connected to H-bus. We set the combined size of
V-MEM and H-MEM equal to that of the baseline CGRA’s local memory. Also MAM which
consists of AGUs are needed for streamed load-store.
In addition, for efficient mapping of DWC with stride of 1, our architecture includes a small
single-port global register file (GRF), which is used to broadcast DWC weights to all PEs. The
index for the GRF is given in the configuration. GRF can be filled either by DMA or through
a dedicated buffer, called Weight Buffer, which can be very small as it is used for DWC only.
6
PE PE PE PE
PE PE PE PE
PE PE PE PE

















































Figure 4: Extended CGRA architecture.
IV Application Mapping for NP-CGRA: DWC Case
We now present our application mapping for DWC kernels (PWC mapping is already outlined in
section 3.2). Table 2 lists parameters and variables used in this section and section V. First we
present a general method that works for any stride, then an optimized version for S = 1, which
is most common. In the following, a tile refers to the amount of work (or corresponding data)
that is done simultaneously by a CGRA, with its size determined by the CGRA size. Block is
the amount of work that can be done using local data only. Block size is a multiple of tile size.
While in this paper we mainly describe PE scheduling and data routing, which is crucial for
maximizing PE utilization and minimizing memory access, our implementation and evaluation
results include complete mapping including data layout and AGU algorithms.
4.1 Depthwise Convolution with Arbitrary Stride
Consider mapping DWC with K = 3 to a 2×2 CGRA. Here we parallelize the computation of
one channel across the PE array. The following equations reveal the terms needed to compute
the first 2×2 output, an output tile.
7
Table 2: Parameters and variables
Symbol Meaning
Nr, Nc Number of rows/columns of a CGRA’s PE array
K, S Kernel size and stride of convolution
Ni, No The number of input/output channels
Nh, Nw The height and width of OFM (Output Feature Map)
Br ×Bc Number of tiles in the current block (row×column)
AIDr,AIDc Zero-based row (or column) number of an H(V)-AGU
Na Number of address bits of a bank (H-MEM or V-MEM)
tidr, tidc Zero-based row/column coordinate of a tile within a block
tcycle Cycle count. A variable whose value is incremented
every clock cycle and reset when a new tile starts
twrap Wrap count. A variable whose value is incremented
on every row index change and reset on every tile start
twcycle Same as tcycle but reset when twrap changes
y0,0 = w0,0x0,0 + w0,1x0,1 + w0,2x0,2 + w1,0x1,0 + · · ·+ w2,2x2,2
y0,1 = w0,0x0,2 + w0,1x0,3 + w0,2x0,4 + w1,0x1,2 + · · ·+ w2,2x2,4
y1,0 = w0,0x2,0 + w0,1x2,1 + w0,2x2,2 + w1,0x3,0 + · · ·+ w2,2x4,2
y1,1 = w0,0x2,2 + w0,1x2,3 + w0,2x2,4 + w1,0x3,2 + · · ·+ w2,2x4,4
The input tile, which is the set of IFM data needed to produce an output tile, is the gray-filled
rectangle in Fig. 5a.
Fig. 5b illustrates our proposed schedule. For instance, to compute the top row of the output
tile, the top three rows of the input tile are needed, which are given sequentially through an
H-bus. Each PE performs MAC operations when they see the corresponding input data on the
H-bus, which simplifies schedule. One can see that our schedule achieves maximal data reuse
within each row, since the data needed for each row is presented only once. Weight parameters
can be provided through V-busses because each column of PEs use the same weight parameters
every cycle.
4.2 Depthwise Convolution with S = 1
Consider an example where K = 3 and the CGRA size is 2×2. Again we handle one channel
at a time. Similar to the general version, our scheme is output stationary such that after a
certain number of cycles the 2×2 PE array will contain the data for the first 2×2 output. The
key problem is how to feed all the PEs with necessary input/weight data every cycle without
oversubscribing memory access resources.
Fig. 6 illustrates our solution. During the initial Nc − 1 cycles (called prologue), IFM data
(the top-left Nr × (Nc − 1) submatrix) is loaded through H-busses into all PEs except the first
8
X0,0 X0,1 X0,2 X0,3 X0,4 X0,5 X0,6 X0,7 X0,8
X1,0 X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8
X2,0 X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8
X5,0 X5,1 X5,2 X5,3 X5,4 X5,5 X5,6 X5,7 X5,8
X6,0 X6,1 X6,2 X6,3 X6,4 X6,5 X6,6 X6,7 X6,8
X7,0 X7,1 X7,2 X7,3 X7,4 X7,5 X7,6 X7,7 X7,8
X8,0 X8,1 X8,2 X8,3 X8,4 X8,5 X8,6 X8,7 X8,8
(a) IFM data (shown in gray is an input tile)
PE0,0 PE0,1
Cycle opA opB opA opB
1 X0,0 W0,0
2 X0,1 W0,1




Cycle opA opB opA opB
1 X2,0 W0,0
2 X2,1 W0,1








X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3







(a) Weight processing order
(c) How IFM data is accessed
(b) Phase definition
Weights Phase #cycles
1 Prolog NC – 1 
2 W00 ~ W02 EE K
3 W12 SS 1
4 W11 ~ W10 EW K – 1
5 W20 SS 1






Prolog H-bus (all but first col)
EE ← H-bus (Last col)
SS ↑ V-bus (Last row)
EW → H-bus (First col)
(d) Data reuse and loading
Figure 6: Schedule and data movement for DWC (S = 1).
9
column. For the next K cycles, the PE array processes the first row of the weight matrix using
IFM data partially reused from the previous cycle (from the east-side PEs) and partially loaded
from local memory (for the easternmost column), which is called EE (Expand East) phase. In
the next cycle, the PE array processesW1,2, which requires reusing IFM data from the south-side
PEs and the southernmost PEs to load new IFM data, called SS (Shift South) phase. In the
next K−1 cycles, the PE array processes the remaining elements of the 2nd row of weight, which
is similar to the EE phase except that we expand west, thus called EW (Expand West). This
pattern of EE-SS-EW-SS is repeated until we finish processing all weight. In this schedule all
PEs use the same weight element, which is provided by GRF, indexed by the CGRA controller.
This schedule takesNc−1+K2 cycles including prologue, except for initial memory streaming
delay and final cycles for writing output data back to local memory. The data layout and AGU
logic to support the above access pattern are a little complicated due to the SS phase. An
alternative would be to load data for the southernmost PEs through H-bus over Nc cycles,
which increases latency significantly. We place the full IFM data in H-MEM and the part
needed for the SS phases in V-MEM. Loading data to both H-MEM and V-MEM is done by
DMA.
Fig. 7b illustrates how data reuse can help achieve high performance in DWC. In this example,
the DWC weight matrix is 3 × 3 matrix (for one channel), stride is one, and the CGRA size
is 2×2. Only one channel is considered in this mapping, which is repeated for all channels to
complete DWC.
To achieve 100% PE utilization, we must generate 2×2 output in 9 cycles (= K2 for our
example), assuming each PE can do one MAC operation per cycle. Fig. 7b shows how to achieve
that, with details such as which elements of the IFM (indicated by red boxes) and which weight
element are used by the CGRA in each cycle. Moreover, only the gray elements are loaded from
the memory and the white IFM elements in red boxes are received from neighbor PEs and thus
reused, which is crucial to achieving optimal mapping with limited memory bandwidth. Most
of the memory accesses can be fulfilled by H-busses, with a few exceptions; T = 6 (or T = 9)
can be done in a single cycle by utilizing V-busses, and step 1 takes two cycles but can be done
as part of initialization and potentially merged with other operations.
Fig. 8 illustrates data path between PE and memory and among PEs. The figure shows the
detail data path of the previous example. Cycle 0 2 are prolog phase. Cycle 3, 4, 5, 10, and 11
are EE phase. Cycle 6, 9 are SS phase. Cycle 7 and 8 are EW phase. It shows that this schedule






(a) Weight data access
pattern
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
T=0~3 T=4 T=5
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
T=6 T=7 T=8
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
X0,0 X0,1 X0,2 X0,3
X1,0 X1,1 X1,2 X1,3
X2,0 X2,1 X2,2 X2,3
X3,0 X3,1 X3,2 X3,3
T=9 T=10 T=11
(b) IFM data access pattern
Figure 7: Data access patterns in DWC.
V Address Generation
We now present our mapping methods for PWC and DWC kernels in more detail, focusing on
data movement within a PE array and between PEs and memories.
5.1 Pointwise Convolution
Fig. 9a illustrates how IFM data can be stored in the external memory. To move this block
of data to the local memory (H-MEM), we first assign consecutive rows of the IFM data into
different banks as illustrated in the Fig. 9b. Then the rows assigned to the same bank are
combined together in a sequential manner and stored into the local memory banks as illustrated
in the IFM data layout. This ensures that all the IFM data can be fed to correct PEs without
ceasing. The weight data are stored in the V-MEM local memory in a similar fashion. One
important difference is that weight data needs to be partitioned along the column direction,
which may require matrix transpose or reshaping. But since weight data is constant during
inference, any preprocessing, if needed, can be done in advance before runtime.
To utilize all PEs for MAC operations, we delegate address generation to AGUs (address
generation units) in memory access units. For PWC, generating addresses to access V-MEM
and H-MEM is straightforward. The V-MEM address is given as: addr = (AIDc  Na) | (tidc ·
Ni+ tcycle), where and | have the same meaning as in C. This address can be easily computed
by an AGU using tcycle that is shared among all AGUs. The H-MEM address should distinguish
between load and store, since H-MEM is used for both OFM and IFM, and is given as in
Algorithm 1.
11
PE0,0 PE0,1 PE1,0 PE1,1
M
Out Reg ← Out Reg+ OpA * OpB
H-MEM bus
V-MEM bus
Weight Register File data
PE internal connection
X1,1 X2,1
* * * *
W0,0
M M M MW0,1
M M M M
X1,4
W0,2
M M M MW1,2
M M M M
X2,2
W1,1
M M M M
X2,1
W1,0
M M M M
X4,1
W2,0
M M M M
X3,3
W2,1







PE: idle, MAM: read data & put on H-MEM bus









































X0,0 X0,1 X0,2 X0,3 X0,4 X0,5
X1,0 X1,1 X1,2 X1,3 X1,4 X1,5
X2,0 X2,1 X2,2 X2,3 X2,4 X2,5
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5







X8,0 X8,1 X8,2 X8,3 X8,4 X8,5 X8,𝑵𝒊-1














X0,0 X0,1 X0,2 X0,3 X0,4 X0,5
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5




X1,0 X1,1 X1,2 X1,3 X1,4 X1,5
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5




X2,0 X2,1 X2,2 X2,3 X2,4 X2,5
X5,0 X5,1 X5,2 X5,3 X5,4 X5,5























(b) Partitioning of IFM data into banks, and IFM data layout
Figure 9: PWC IFM data in external memory and H-MEM.
Algorithm 1 Generate H-MEM addresses for PWC
1: if tcycle < Nc then
2: // generate load address
3: addr← tidr ·Ni + tcycle + addrIFM
4: else
5: // generate store address
6: addr← tidc ·Nc + tidr ·Nc ·Bc + tcycle −Ni + addrOFM
7: end if
8: // prepend bank index




X0,0 X0,1 X0,2 X0,3 X0,4 X0,5
X1,0 X1,1 X1,2 X1,3 X1,4 X1,5
X2,0 X2,1 X2,2 X2,3 X2,4 X2,5
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5
X5,0 X5,1 X5,2 X5,3 X5,4 X5,5
X6,0 X6,1 X6,2 X6,3 X6,4 X6,5












X12,0 X12,1 X12,2 X12,3 X12,4 X12,5 X12,12
(a) Logical view of IFM data, and H-MEM
bank assignment




X0,0 X0,1 X0,2 X0,3 X0,4 X0,5
X1,0 X1,1 X1,2 X1,3 X1,4 X1,5
X6,0 X6,1 X6,2 X6,3 X6,4 X6,5
X7,0 X7,1 X7,2 X7,3 X7,4 X7,5
X12,0 X12,1 X12,2 X12,3 X12,4 X12,5
X2,0 X2,1 X2,2 X2,3 X2,4 X2,5
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5
X8,0 X8,1 X8,2 X8,3 X8,4 X8,5
X9,0 X9,1 X9,2 X9,3 X9,4 X9,5
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5
X5,0 X5,1 X5,2 X5,3 X5,4 X5,5
X10,0 X10,1 X10,2 X10,3 X10,4 X10,5









































(b) Partitioning of IFM data into banks, and IFM data layout
Figure 10: DWC2 IFM data in external memory and H-MEM.
5.2 Depthwise Convolution with Arbitrary Stride
The data layout to support the proposed mapping is illustrated in Fig. 10. First the rows of
the IFM data (which can be regarded as 2D since we consider only one channel at a time) are
mapped to banks as Fig. 10a, where the idea is to map each set of continuous S rows starting
from the top to the next bank. Second, all the rows mapped to a bank are combined and placed
into the bank in a sequential manner (see Fig. 10b).
Note that contrary to PWC, the data layout for DWC does not place all the IFM data
needed for one row of CGRA PEs into one bank, which however does not cause a problem, since
(i) there is a crossbar switch between the set of H-AGUs and the set of memory banks and (ii)
there is no bank conflict (i.e., all H-AGUs access different memory banks all the time). To show
the absence of bank conflict, it suffices to see that the 2nd H-AGU always accesses an input row
that is S-rows below what the 1st H-AGU accesses, and so on.
The weight parameters needed by PEs are uniform vertically but not uniform horizontally,
which suggests the use of V-busses. Thus we store weight parameters in V-MEM (duplicated in
all banks) and use V-busses to provide weight parameters for PEs as in PWC mapping.
The address generated by V-AGUs is: addr = (AIDc  Na) | (twcycle−AIDc ·S+ twrap ·K).
Here twrap tracks which IFM row the CGRA is currently processing, or the row number of weight
14
Table 3: Performance analysis
PWC DWC General DWC Optimized
Tile latency 𝑵𝒊 + 𝝀 𝑲 × ((𝑵𝒄 − 𝟏)𝑺+K) + 𝝀 𝑲
𝟐 +𝑵𝒄 − 𝟏 + 𝝀
Block latency 𝑩𝒓𝑩𝒄𝑻













parameters accessed by PEs. The H-MEM addresses are given in Algorithm 2.
Algorithm 2 Generate addresses for DWC2 H-MEM access
1: blockw ← S · (Bc ·Nc − 1) +K
2: if twrap < K then
3: // generate load bank index and address
4: over_bank← ((twrap/S) + AIDr)/Nr
5: //generate load bank index
6: banknumber ← ((twrap/S) + AIDr)%Nr
7: addr← tidr · blockw · S + tidc · S ·Nc + over_bank · blockw · S + twcycle + (twrap%S) · blockw
8: else
9: // generate store bank index and address
10: banknumber ← AIDr
11: addr ← AIDc ·Nc + tidr ·Nc ·Bc + twcycle − 1 + addrOFM
12: end if
13: return (banknumber  Na) | addr
5.3 Depthwise Convolution with S = 1
DWC with s = 1 use same method for storing data in H-MEM. But this method can’t be used
for V-MEM because V-MEM stores only the data required for SS phases. The data layout to
support the proposed mapping is illustrated in Fig. 11a. Since they are separated by the interval
of the CGRA column size, they store in V-MEM based on this interval. For example, X3,2, X3,5
and X3,8 are stored in Bank0. Fig. 11 shows data layout in external memory and partial IFM
data partitioned into banks and layed out on V-MEM. Weight data for DWC with s = 1 is
stored in GRF.
5.4 Performance Analysis
Table 3 summarizes latency of each mapping, where λ is used to capture constant delay due to
initial/final delay in pipelining. Note that the analytical performance models are provided only
to characterize our mapping scheme. Our performance evaluation is based on cycle-accurate
simulation (see section 6.1).
15
Algorithm 3 Generate addresses for DWC1 H-MEM access
1: blockw ← 2 +Bc ·Nc
2: tile_latency← 1 + 2 ·Nc +K2
3: // generate bank index
4: if twrap ≥ K then
5: banknumber ← AIDr
6: else
7: over_bank← (twrap + AIDr)/Nr
8: //generate load bank index
9: banknumber ← (twrap + AIDr)%Nr
10: end if
11: if twrap ≥ K then
12: // generate store address
13: addr ← tidc ·Nc + tidr ·Nc ·Bc +Nc + cycle− tile_latency + 1 + addrOFM
14: else
15: std_addr← tidc ·Nc + tidr · blockw
16: // generate load address
17: if twrap = 0 then
18: // Kernel 0 row
19: addr← std_addr + twcycle + over_bank ∗ blockw
20: else
21: if twrap%2 = 1 then
22: // Kernel odd row
23: addr← std_addr +K − 1− twcycle + over_bank · blockw
24: else
25: // Kernel even row




30: return (banknumber  Na) | addr
16
X0,0 X0,1 X0,2 X0,3 X0,4 X0,5 X0,6 X0,7 X0,8 X0,9 X0,10
X1,0 X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X2,0 X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X3,0 X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X4,0 X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
X5,0 X5,1 X5,2 X5,3 X5,4 X5,5 X5,6 X5,7 X5,8 X5,9 X5,10
X6,0 X6,1 X6,2 X6,3 X6,4 X6,5 X6,6 X6,7 X6,8 X6,9 X6,10
X7,0 X7,1 X7,2 X7,3 X7,4 X7,5 X7,6 X7,7 X7,8 X7,9 X7,10
X8,0 X8,1 X8,2 X8,3 X8,4 X8,5 X8,6 X8,7 X8,8 X8,9 X8,10
X9,0 X9,1 X9,2 X9,3 X9,4 X9,5 X9,6 X9,7 X9,8 X9,9 X9,10
X10,0 X10,1 X10,2 X10,3 X10,4 X10,5 X10,6 X10,7 X10,8 X10,9 X10,10
(a) Logical view of IFM data and V-MEM
bank assignment








































(b) Partial IFM data partitioned into banks, and layed out on
V-MEM
Figure 11: IFM data in external memory and in V-MEM for DWC (shown in red is a tile).
PWC
PWC mapping multiplies Nw×Ni IFM matrix with Ni×No weight matrix Nh times. In order to
multiply IFM and weight matrix, these matrix should be divided into BrNr×Ni and Ni×BcNc
blocks. The number of blocks is dNw/(BrNr)e × dNo/(BcNc)e, each of which takes BrBcT
cycles, producing the layer latency.
DWC
DWC General (i.e., arbitrary stride) mapping divides IFM data by block size per channel. To
compute one channel, dNh/(BrNr)e×dNw/(BcNc)e blocks are generated. When processing one
tile, ((Nc − 1) · S +K) IFM data is used, along with 1 ×K weight data, which is repeated K
times. Thus the tile latency is K · ((Nc− 1) ·S+K)+λ. DWC General shares the layer latency
formula with DWC Optimized (i.e., S=1), though tile latency is different.
Further optimization
We use PWC mapping schedule after im2col to execute the standard convolution. Even if pwc
mapping scheduling is handled quickly, the im2col overhead may slow processing speed. We can
reduce the im2col overhead if we first process the data in the direction of the channel rather
than running the im2col to process the data within the channel first.
In addition, our dwc algorithm has the problem that when processing data for multiple
channels. It repeats processing 1 channel and loading the data rather than processes multiple
17
channel and loading the data. It takes more communication time than computation time when
the height and width of IFM are small. So we will add a method of continuous processing of
channel data when processing data for multiple channels with input of small height width. In
the future, we will solve these two problems to increase efficiency by reducing network inference
time despite the AGU overhead.
18
Table 4: NP-CGRA specifications
Number of PEs 64 (8×8)
Word size 16-bit
Clock frequency 500 MHz
Off-chip memory bandwidth 12.5 GB/s
DMA latency 200 cycles
H-MEM size (= V-MEM size) 39 KB (×2 sets)
Configuration memory size 9248 bytes (2312×32 bits)
Weight buffer size 1152 bytes (144×64 bits)
VI Experiments
6.1 Experimental Setup
To evaluate the effectiveness of our proposed architecture, we use MobileNets and compare
against previous CGRA approaches as well as other DPUs. However, since MobileNet results
are not reported by previous CGRA architectures, we also map AlexNet convolution layers to
NP-CGRA and compare our result with those of previous CGRAs and DPUs as reported in the
literature.
Our main comparison metric is inference throughput (frames/s) and cost efficiency (in ADP).
We have developed a cycle-accurate simulator and also designed RTL for the baseline CGRA
and our NP-CGRA, including PE array, AGUs, GRF, and the CGRA controller, which we
have validated in terms of functionality and cycle-level behavior. For area estimation we have
synthesized RTL designs with Synopsys Design Compiler using Samsung 65 nm standard-cell
library. The on-chip memory area is estimated using Cacti 7.0 [23].
Table 4 lists specifications of NP-CGRA. The off-chip memory bandwidth is set to 12.5 GB/s
as in SDT-CGRA [5]. H-MEM and V-MEM have the same size, which is set to NiK2×Nr words,
to make mapping AlexNet easier, although smaller memory sizes can also be accommodated by
our mapping strategy. The number of configuration bits per cycle is 2312 = 36×64 + 8; each
PE needs 4 more bits than a baseline PE due to increased input MUX sizes (1 bit) and the
operand reuse network’s MUXes (3 bits), and 8 more bits globally for GRF index and to control
streamed load-store. Weight Buffer, which is optional, is set to hold 64 copies of GRF contents.
6.2 Depthwise Separable Convolution Results
We use the first three layers right after the first standard convolution (i.e., 3D convolution) layer
in MobileNet V1 [8] (width multiplier 1, resolution 224). We compare three cases:
• Baseline + CCF: Baseline CGRA with CCF compiler [21]
• Matmul DWC: NP-CGRA + Matrix multiplication-based DWC
19
Table 5: MobileNet DSC result
Metric Layer CCF Matmul DWC Our mapping
Latency PWC 78.91 (8.14) 3.72 (86.42) 3.72 (86.42)
(util) DWC (S=1) 11.10 (8.14) 2.82 (16.04) 0.92 (49.00)
(ms,%) DWC (S=2) 7.74 (5.83) 1.41 (16.01) 0.81 (28.00)
ADP PWC 122.48 6.83 6.83
(mm2·ms) DWC (S=1) 17.22 5.17 1.69
DWC (S=2) 12.02 2.59 1.48
• Our mapping: NP-CGRA + Our mapping scheme for PWC/DWC
For this experiment only, the CGRA size is set to 4×4 due to CCF compilation flow (for all
three cases). The clock speed is 500 MHz for both the baseline and NP-CGRA.
The first case represents the state-of-the-art CGRA solution. For CCF, we apply loop pipelin-
ing to the loop level with the largest trip count, which is image height (Nh). The second case
uses our mapping scheme for PWC only. DWC is converted into matrix multiplication by im2col,
essentially using only one column of a CGRA, to which theK2 dimension is mapped. The im2col
time isn’t taken into the account in this part.
Table 5 summarizes the result. The architectural factor is about 2×, since our NP-CGRA
has 2× faster arithmetic and memory operation rate than the baseline CGRA. So the large
performance difference is attributed to mapping. A close look at the generated code has revealed
that CCF generates extra 1 MUL and 3 ADD ops for every MAC operation (1 MUL, 1 ADD)
in the program, which is due to address generation as it uses addressed load-store. Also the
scheduled code has some empty slots, which further lowers the PE utilization. Overall, the
mapping efficiency difference is about 10× in the case of PWC for the relatively small CGRA
size. We expect the difference to increase for larger CGRA sizes. All in all, our NP-CGRA
generates over 20× speed up and close to 18× ADP reduction for PWC over the baseline (our
architecture has 18% larger total area including SRAM memory; synthesis result is discussed in
section 6.3).
For DWC our NP-CGRA continues to deliver better performance and ADP than the baseline.
While the utilization of the Matmul DWC case is around 16% (and cannot exceed 25% using
only one CGRA column), our DWC mapping generates about 1.75∼3× higher performance
and efficiency than the matmul-based mapping. Note that DWC (S=2) layers are the rarest
in MobileNets while PWC accounts for the majority of MAC operations, which may justify
relatively low effort optimizing for the former case.
6.3 Hardware Overhead Evaluation
Fig. 12 compares the synthesis area of two 8×8 CGRAs at the target frequency of 500 MHz
(timing met in both). The largest core increase comes from AGUs, which may be justified given
20
Table 6: Comparison with previous CGRA and DPU implementations
Eyeriss [10] Eyeriss-v2 [11] Auto- SDT- NP-CGRA
tuning [6] CGRA [5] (Ours)
Technology ASIC ASIC CGRA CGRA CGRA
(65 nm) (65 nm) (32 nm) (55 nm) (65 nm)
Clock frequency (MHz) 200 200 500 450 500
#PEs (#Ops/cycle) 168 (336) 192 (768) 16 (16) 25 (205) 64 (128)
Data width (bits) 16 8 32 16 16
On-chip data memory 108 192 320 54.6 156
(kB)
Reported area 12.25 ≥ 12.25 1.55† 5.19 2.14
(mm2)
Converted area 12.25 ≥ 24.50 1.55† 7.25 2.14
(65 nm, 16-bit) (mm2)
MobileNet V1 - 0.78 - - 4.01
(DSC runtime, ms)
MobileNet V2 - - - - 18.06
(DSC runtime, ms)
MobileNet V1 ADP - 19.11 - - 8.60
(DSC only, mm2·ms)
AlexNet 28.82 9.79 990 23.24 40.07‡
(Conv. runtime, ms)
AlexNet ADP 353.03 239.96 1536.68 168.59 87.28‡
(Conv. only, mm2·ms)
†Not reported in the paper, and assumed to be the area of the 4×4 baseline CGRA.




















PE array AGU Controller GRF





















(b) Total area comparison
Figure 12: Area comparison.
the so many freed PEs by AGUs. The common logic and variables used by AGUs such as
iterators are implemented in the controller, shown in the graph. The increase in the PE array
is modest (the baseline architecture has homogeneous operation set, meaning all PEs support
MUL and ADD operations). Not surprisingly, the total area is dominated by SRAM memory,
which puts the overall area overhead of NP-CGRA at 22.2%.
While we use the same clock frequency for both CGRAs in our ADP evaluation, our dual-
mode MAC does increase the critical path delay. When driven for maximum speed, the critical
path delay is increased from 1.23 ns (baseline) to 1.65 ns (NP-CGRA), which is due to the
difference between MAC delay (1.08 ns) and MUL delay (0.68 ns). Considering the potential
2× increase in computation throughput, the 34% increase in cycle time seems justifiable. On
the other hand, MAC operations are not utilized by current CGRA compiler (e.g., CCF), which
can limit applicability.
6.4 Comparison with Previous Work Using MobileNet
No previous CGRA reports MobileNet or DSC performance. A few MobileNet accelerators
for FPGAs exist but no reported ASIC area makes direct comparison difficult. Eyeriss v2 [11]
targets MobileNet V1 with width multiplier 0.5 and resolution 128, which we compare in Table 6.
Eyeriss v2 has much more capable PEs than NP-CGRA, performing 2 MAC ops per cycle, which
partially explains higher absolute performance compared with NP-CGRA. On the other hand,
NP-CGRA is much smaller. While Eyeriss v2 reports gate count only, it appears larger than
Eyeriss, so we assume Eyeriss v2 has the same area as Eyeriss. Also Eyeriss v2 uses 8-bit data
width, we convert the area number to 16-bit equivalent by multiplying 2, which we believe is
conservative. Overall, the NP-CGRA turns out to have lower ADP (2.22×) though it is due to
its faster clock speed.
22
6.5 AlexNet Convolution Layer Results
While 3D convolution is not explicitly optimized for by our architecture, we map AlexNet convo-
lution layers to NP-CGRA, for quantitative comparisons with previous CGRA results as well as
to see broader applications of our extensions outside DSC layers (see Table 6). For NP-CGRA,
we convert convolution into matrix multiplication using im2col and use PWC mapping. The
im2col part is assumed to be done on the ARMv8 processor on Xilinx Ultra96-V2 board, which
we have used to measure the runtime of im2col functions. The auto-tuning approach [6] applies
various combinations of loop transformations (e.g., interchange, unrolling) to find the best loop
nest for CGRA mapping, which is done by an in-house CGRA compiler. SDT-CGRA [5] is
a novel architecture optimized for machine learning algorithms including convolutional neural
networks (CNNs). Eyeriss [10] and Eyeriss v2 [11] are hard DPUs optimized for CNNs.
To allow comparisons among different technologies and data widths, we convert the reported
areas into 65 nm, 16 bit-equivalents, which are then multiplied with runtime to calculate ADP.
As expected, the auto-tuning approach has the lowest performance and efficiency, attributed to
poor scheduling. Eyeriss and Eyeriss v2 are among the fastest while SDT-CGRA is the most
efficient in terms of ADP, which is again due to its faster clock speed. Our NP-CGRA result
does not include the area of the ARM processor, but it is quite competitive with other CGRA or




We presented a set of generic architecture extensions for CGRAs that can greatly improve
performance and efficiency for light-weight DNN models. We have also demonstrated that our
proposed features are useful beyond DSC layers, such as for 3D convolution. We plan to apply
our NP-CGRA to accelerating other machine learning algorithms and digital filters, many of
which are based on matrix multiplication and convolution. Automatic generation of efficient
code that exploits our proposed architectural features is left for future work.
24
References
[1] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay,
M. Haselman, L. Adams, M. Ghandi et al., “A configurable cloud-scale dnn processor for
real-time ai,” in 2018 ACM/IEEE 45th ISCA. IEEE, 2018, pp. 1–14.
[2] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia,
N. Boden, A. Borchers et al., “In-datacenter performance analysis of a tensor processing
unit,” in Proceedings of the 44th ISCA, 2017, pp. 1–12.
[3] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network
acoustic models,” in ICML, 2013.
[4] M. Tanomoto, S. Takamaeda-Yamazaki, J. Yao, and Y. Nakashima, “A CGRA-based ap-
proach for accelerating convolutional neural networks,” in 2015 IEEE 9th MCSoc, 2015.
[5] X. Fan, D. Wu, W. Cao, W. Luk, and L. Wang, “Stream processing dual-track CGRA for
object inference,” IEEE Trans. VLSI, vol. 26, no. 6, pp. 1098–1111, 2018.
[6] I. Bae, B. Harris, H. Min, and B. Egger, “Auto-tuning CNNs for coarse-grained reconfig-
urable array-based accelerators,” IEEE TCAD, vol. 37, no. 11, 2018.
[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolu-
tional neural networks,” in Advances in NIPS, 2012, pp. 1097–1105.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto,
and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision appli-
cations,” arXiv preprint arXiv:1704.04861, 2017.
[9] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted
residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2018, pp. 4510–4520.
[10] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable
accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits,
vol. 52, no. 1, pp. 127–138, 2016.
25
[11] Y. Chen, T. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging
deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics
in CAS, vol. 9, no. 2, pp. 292–308, 2019.
[12] S. Dave, M. Balasubramanian, and A. Shrivastava, “Ramp: resource-aware mapping for
cgras,” in 2018 55th ACM/ESDA/IEEE DAC. IEEE, 2018, pp. 1–6.
[13] H. Singh, M.-H. Lee, G. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. Chaves Filho, “Mor-
phosys: an integrated reconfigurable system for data-parallel and computation-intensive
applications,” IEEE transactions on computers, vol. 49, no. 5, pp. 465–481, 2000.
[14] B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “Adres: An architecture
with tightly coupled vliw processor and coarse-grained reconfigurable matrix,” in Interna-
tional Conference on FPL. Springer, 2003, pp. 61–70.
[15] D. Suh, K. Kwon, S. Kim, S. Ryu, and J. Kim, “Design space exploration and implementa-
tion of a high performance and low area coarse grained reconfigurable processor,” in 2012
International Conference on FPT. IEEE, 2012, pp. 67–70.
[16] Y. Park, H. Park, and S. Mahlke, “Cgra express: accelerating execution using dynamic
operation fusion,” in CASES, 2009, pp. 271–280.
[17] M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agarwal, “Scalar operand networks: On-chip
interconnect for ilp in partitioned architectures,” in HPCA. IEEE, 2003.
[18] J. Balfour, R. Harting, and W. Dally, “Operand registers and explicit operand forwarding,”
IEEE Computer Architecture Letters, vol. 8, no. 2, pp. 60–63, 2009.
[19] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient
neural network,” in Advances in NIPS, 2015, pp. 1135–1143.
[20] R. Zhao, Y. Hu, J. Dotzel, C. D. Sa, , and Z. Zhang, “Building efficient deep neural networks
with unitary group convolutions,” in CVPR, 2019.
[21] S. Dave and A. Shrivastava, “Ccf: A cgra compilation framework.”
[22] S. A. Chin, N. Sakamoto, A. Rui, J. Zhao, J. H. Kim, Y. Hara-Azumi, and J. Anderson,
“Cgra-me: A unified framework for cgra modelling and exploration,” in ASAP, 2017, pp.
184–189.
[23] R. Balasubramonian, A. B. Kahng, N. Muralimanohar, A. Shafiee, and V. Srinivas, “Cacti
7: New tools for interconnect exploration in innovative off-chip memories,” ACM TCAO,
vol. 14, no. 2, pp. 1–25, 2017.
26
Acknowledgements
Thank you to the advisior professor for helping me to write this thesis, and to all of you who
came to me when I was in trouble.
27

