ShortcutFusion: From Tensorflow to FPGA-based accelerator with
  reuse-aware memory allocation for shortcut data by Nguyen, Duy Thanh et al.
 1 
 
Abstract—Residual block is a very common component in 
recent state-of-the art CNNs such as EfficientNet/EfficientDet. 
Shortcut data accounts for nearly 40% of feature-maps access in 
ResNet152 [8]. Most of the previous DNN compilers/accelerators 
ignore the shortcut data optimization. This paper presents 
ShortcutFusion, an optimization tool for FPGA-based accelerator 
with a reuse-aware static memory allocation for shortcut data, to 
maximize on-chip data reuse given resource constraints. From 
TensorFlow DNN models, the proposed design generates 
instruction sets for a group of nodes which uses an optimized data 
reuse for each residual block. The accelerator design implemented 
on the Xilinx KCU1500 FPGA card significantly outperforms 
NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the 
EfficientNet inference. Compared to RTX 2080 Ti, the proposed 
design is 1.35-2.33 faster and 6.7-7.9 more power efficient. 
Compared to the result from baseline, in which the 
weights/inputs/outputs are accessed from the off-chip memory 
exactly once per each layer, ShortcutFusion reduces the DRAM 
access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and 
EfficientNet. Given a similar buffer size to ShortcutMining [8], 
which also “mine” the shortcut data in hardware, the proposed 
work reduces off-chip access for feature-maps 5.27 while 
accessing weight from off-chip memory exactly once. 
 
Index Terms— End-to-end, CNN accelerator, FPGA, shortcut 
reuse, reuse-aware, shared MAC 
I. INTRODUCTION 
HERE have been many works trying to reduce the 
complexity and model size of CNNs using depthwise 
convolution [1]-[5], [40]. However, the question whether 
they are really efficient when running on general-purpose 
processor, such as CPUs/GPUs, has not been studied 
thoroughly. A previous study [47] showed that depthwise 
convolution achieves low performance on both the training and 
inference of various deep learning frameworks such as 
Tensorflow [29], Darknet [24], and Pytorch [30]. 
Recent state-of-the-art compact CNNs such as EfficientNet 
[1], EfficientDet [2] and MobileNet v3 [40], which combine 
mobile inverted bottleneck (MBconv) [4] and Squeeze-and-
Excitation optimization (SE block) [6] as shown in Fig. 1, have 
achieved a new record for high accuracy in 
classification/detection/segmentation tasks while being less 
complex (i.e., BFLOPS) and more compact compared to 
previous works. For instance, EfficientNet-B1 achieved a 
higher accuracy than that of ResNet152 [22] (78.8 vs 77.8) with 
7.6 times less parameters and 16 times less FLOPS. Despite 
being much more compact, its inference speed on an Intel CPU, 
NVIDIA Titan Xp (12 TFLOPS) and NVIDIA GTX 1080 Ti 
(11.3 TFLOPS) is not really fast shown in Fig. 2. For running 
on an edge accelerator such as the Google edge-TPU, the 
original EfficientNets are optimized by replacing depth-wise 
convolutions by normal convolutions and removing SE blocks 
at the cost of some accuracy loss [41]. Due to the very deep 
architecture and lightweight model, beside the kernel 
scheduling overhead, memory bottle-neck is also important 
factor to the inference of compact CNNs. For example, for a 
768x768 input size and 8-bit precision, EfficientNet-B1 
requires 13.34 BFLOPS and 475 MB of intermediate data 
access when the inputs/outputs   are accessed from the off-chip 
memory exactly once while the model size is merely 9 MB. 
There are previous works on end-to-end frameworks for 
accelerator designs on FPGAs [15]-[19], [33], [42]-[46]. FP-
Duy Thanh Nguyen, Hyeonseung Je, Tuan Nghia Nguyen, Soojung Ryu, Kyujung Lee, and Hyuk-Jae 
Lee, Member, IEEE 
ShortcutFusion: From Tensorflow to FPGA-
based accelerator with reuse-aware memory 
allocation for shortcut data 
T 
   D. T. Nguyen, H.-S. Jae were with Department of Electrical and Computer 
Engineering, Seoul National University, Seoul 08826, Korea. They are now 
with Samsung Electronics, Korea (email: thanhnd@capps.snu.ac.kr) 
   T. N. Nguyen, and H.-J. Lee are with the Inter-University Semiconductor 
Research Center, Department of Electrical and Computer Engineering, Seoul 
National University, Seoul 08826, Korea  
   S.-J Ryu is with SK Telecom, Korea. 
   K.-J. Lee is with the Department of Electronic Engineering, Sun Moon 
University, Asan, Korea 
Corresponding authors: H.-J. Lee, K.-J. Lee. 
 




Fig. 2: Latency (ms) of the EfficientNet-B1 inference (batch size 1) 
for different input sizes. 
 
Block A Block BSqueeze & Excitation
Global Average Pooling
1 x 1 Conv + Swish
1 x 1 Conv + Sigmoid
k x k DW Conv + BN + Swish
Squeeze & Excitation
1 x 1 Conv + BN
+
1 x 1 Conv + BN + Swish 
k x k DW Conv + BN + Swish
Squeeze & Excitation











CPU (i7) w/ OpenMP






DNN in [15] allocates a minimal number of physical buffers in 
DRAM (not SRAM) as a memory pool for implementation; it 
does not leverage the on-chip buffer in the FPGA efficiently to 
reduce the off-chip access. Argus in [16] provides an end-to-
end framework for a multi-CLP type accelerator [53]. The 
studies in [18] and [19] also provide an end-to-end framework 
for a multi-layer processor. Because the BRAMs of a FPGA are 
not enough for multiple hardware units, the data have to be 
stored in the off-chip memory. Therefore, in [16], [18], and [19], 
a large amount of data access is required for deeper networks. 
Even though they work fine for shallow networks, their 
scalability to a wide range of CNNs is limited. There are 
existing frameworks from Tensorflow to FPGAs, such as in 
[17], [33], [42] and [44]-[46]. DNNWeaver [42] does not 
support layer fusion while the others only support the fusion of 
Convolution, Activation and Batch Norm and/or Pooling. As 
mentioned in [8], the shortcut connection accounts for 40% of 
the feature-maps access in Resnet-152. All of these frameworks 
do not support in-hardware flexible data reuse schemes and 
neglect cross-layer shortcut data optimization which might 
cause sub-optimal off-chip access. 
There have been many previous studies on the dataflow of 
CNN computing [8]-[12], [51]. Among [9], FlexFlow [11], [12], 
and [51] which optimize the dataflow for a shallow network 
VGG16, SmartShuttle [12] achieves the highest number of 
MAC/DRAM accesses. It proposes to switch between two data 
reuse schemes: partial sum oriented and weight oriented. It 
requires 434.8 MAC/DRAM accesses (i.e., 214 MB) for 
running the CONV layers of VGG16. For a deeper network 
such as ResNet-152 or EfficientNet, larger off-chip accesses 
might cause a long latency because it requires larger data for 
shortcut connection and feature-maps. As reported in the paper, 
the buffer size, which is larger than 512 KB, does not help to 
reduce the DRAM access despite supporting flexible tiling 
factors. An efficient scheme needs to read the weights and 
feature-maps exactly once from the off-chip memory with a 
limited on-chip buffer. 
The contributions of this paper are listed below. 
 An end-to-end design flow from Tensorflow frozen model 
to CNN inference on an FPGA-based accelerator. A CNN 
compiler with reuse-aware static memory allocation which 
supports cross-layer shortcut reuse to overcome the 
challenge of latency optimization with on/off-chip memory 
constraints for a wide range of CNNs. 
 A hardware accelerator architecture with shared MAC 
(Multiplication-and-ACcumulation) arrays that tailors the 
CNN compiler is presented. 
 Comprehensive experiments demonstrate that 
ShortcutFusion is more efficient in reducing the off-chip 
access with a similar buffer size compared to the previous 
works. Even though ShortcutFusion is validated with 
FPGA-based accelerator in this study, ShortcutFusion is also 
applicable to ASIC design with a unified buffer to optimize 
both on-chip buffer size and off-chip DRAM access. It 
outperforms the recent CPUs/GPUs when running recent 
state-of-the-art SE-based CNNs such as 
EfficientNet/EfficientDet/MobileNet v3. 
II. BACKGROUND 
Low latency with small batch size is very important in real 
time CNN inference. Therefore, this work optimizes the latency 
with batch size of 1. There are only two main weight block 
reuse schemes for the tiled-based convolutional computation: 
frame-based weight reuse and row-based weight reuse [23]. It 
should be noted that the weight reuse term here refers to each 
tiled weight block. For example, for a 33 CONV layer, each 
tiled weight block is 33TiTo, where Ti and To are the 
parallelism factors for input and output channels, respectively. 
The computation of the row-based weight reuse scheme is 
depicted in Fig. 3(a). It should be noted that the inputs are 
loaded once in a row-by-row manner. The input sliding cube, 
while slides along the width of the input image, is convolved 
with To weight blocks each time to produce To temporary output 
values. These weight blocks are reused for a row. The input 
sliding cube then shifts Ti channels toward the end of N-input 
channels until all input channels are read. To finish the 
processing for one row, the entire weight of the model is 
accessed from the memory. Therefore, to process the whole 
input feature-maps, the weights are read H times. 
Fig. 3(b) illustrates the computation flow of the frame-based 
weight reuse scheme. Because each weight block is reused for 
an entire frame (i.e., width  height), it needs to be read from 
the buffer exactly once. Therefore, only a small weight block 
needs to be buffered. It is noteworthy that the input/output 
feature-maps are accessed multiple times from the on-chip 
buffer. This scheme is efficient for layers with a large weight 
size and a small feature-map size, in which the input/output 
feature-maps can be accessed on-chip. Meanwhile, the latency 
of reading the weight blocks from the off-chip memory can be 
hidden by the computation of the sub-frame input. 
III. SHORTCUTFUSION 
A. Overview 
A block diagram of the proposed framework is shown in Fig. 





Fig. 3. The scheduling for streaming convolutional layer: (a) Frame-based 


































frameworks for deep learning research: Tensorflow [29], 
Pytorch [30], Caffe [31] and Darknet [24]. Tensorflow is one of 
the most popular DNN frameworks that was developed by 
Google. Hence, this paper uses Tensorflow in the front-end of 
ShortcutFusion. 
It is well known that a CNN is tolerant to errors. Previous 
research in [26] and [27] shows that 8-bit is efficient for various 
DNN inference tasks. The Google TPU [28] also uses 8-bit for 
both training and inference. Therefore, this study adapted an 8-
bit quantization for the accelerator design. The Tensorflow 
model file (protobuf file) is exported to the CNN parser & 
analyzer for parsing the architecture of the given CNN and 
extracting the quantized parameters. As depicted in Fig. 5 (a), 
the searched nodes of the CNN architecture are then 
reorganized into groups as supported by the back-end 
accelerator. Existing frameworks such as CloudDNN [17], [33], 
TensorRT [39] and Xilinx ML-Suite [44] support the fusion of 
CONV, Relu and BatchNorm and/or Pooling only. The lack of 
reuse of cross-layer shortcut data might cause large off-chip 
accesses thereby affecting the system performance. To address 
this issue, ShortcutFusion try to group as many nodes as 
possible to reduce the intermediate data movement and runtime 
overhead. For example, Convolution, Activation, 
Normalization, Pooling, Element-wise (shortcut pass), and/or 
Up-sampling layers are fused together as supported in the back-
end accelerator. Like TensorRT, the feature-merging for 
concatenation in the row-reuse mode of this work also supports 
redirecting the output to the eventual destination of the 
concatenation to avoid data movement. The parameters 
extracted from the CNN parser & analyzer are used in the 
unified software reference code for hardware verification. In 
addition, the CNN architecture information is used in the Block-
wise optimizer to select the optimal data access scheduling for 
each group of nodes in terms of the latency, on-chip buffer 
requirement and DRAM access. The on-chip buffer selection 
and parallelisms are taken to configure the accelerator. It is 
noteworthy that the reuse-aware shortcut optimizer satisfies a 
strict constraint of the DRAM access, in which the parameters 
are accessed exactly once, the inputs/outputs of some layers are 
accessed from the DRAM exactly once, and the inputs/outputs 
of other layers are stored on-chip. This constraint is used in 
optimization problem in section IV-B (equation 10). Finally, 
the inference code generates instructions for entire layers of the 
CNN. As depicted in Fig. 5(b), the instruction sets of each node 
group consists of 11 words describing convolution size, 
activation type, pooling/upsampling option, fused element-
wise, etc. It is noteworthy that the inference code packs 
parameters, input and all instructions and sends them at once to 
the hardware accelerator. The detail of the reuse-ware shortcut 
optimizer will be presented in section IV. 
B. Architecture of the FPGA-based CNN accelerator to 
support ShortcutFusion 
Fig. 6 describes the architecture of the accelerator that tailors 
ShortcutFusion. As mentioned in [8], the shortcut connection 
accounts for 40% of the feature-maps access in Resnet-152. 
Therefore, to maximize the shortcut data reuse, the proposed 
accelerator has an additional buffer for shortcut data. These 
physical buffers {0,1,2} are three interchangeable buffers for 
storing the input, output, shortcut data, or parameters of the 
entire layer. The group-wise dataflow controller is able to 
switch between two levels of weight reuse based on the per-
group instructions to balance between computation and off-chip 
memory access. The wide circular row buffer and double 
weight buffer provide the high bandwidth for feeding the 
sliding windows and weight to the parallel convolution kernels. 
The partial sum from the MAC arrays are stored temporarily in 
the out buffer. As different CNN layers may require different 
 
Fig. 4.  Block diagram of the entire framework with ShortcutFusion. 
 
   
                        (a) EfficientNet: CNN analyzer re-organizes 418 nodes to 139 groups for execution          (b) Group-wise instructions 


















Outputs of final layers




























Group 0 (11 words)
Group N (11 words)
…
Instruction format for 
a given CNN …
…
Per-group configuration













value ranges for data, the proposed design supports a dynamic 
fixed point format to preserve the accuracy. Finally, there is a 
chain of modules such as max-pooling, average-pooling, 
element-wise addition, and up-sampling. The outputs from the 
parallel kernels are forwarded directly to this chain without 
storing back to the memory to reduce the data movement. It 
should be noted that these modules also support different data 
reuse schemes same as the convolutional module, thereby, have 
connection to the three physical buffers. 
1) Convolution kernel design with shared MACs 
A single DSP48E2 in Ultrascale and Ultrascale+ supports 
two INT8 multiplications sharing a same operand [32] to 
increase the DSP efficiency. As the proposed CNN accelerator 
targets multiple CNNs in various applications, it supports both 
8-bit signed and unsigned feature-maps. In addition, the 
weights use 8-bit non-zero quantization which has been proved 
to have a higher accuracy compared to the zero-centered 
quantization [23]. Therefore, the proposed design requires 
signed 9x9 multiplication which is not inherently supported by 
DSP48E2 in the double MAC mode. To make it possible, a 
correction logic is added as described in Fig. 7(a), where Mult0 
= IW0, Mult1 = IW1.  
Double multiplication can be applied to the normal 
convolution because the input feature-maps are shared among 
the different weight filters. However, the depth-wise 
convolution does not have such input reuse. Fig. 7(b) shows the 
block design of a shared MAC. It supports double 
multiplications for the normal convolution and single 
multiplication for the depth-wise convolution. Finally, as 
depicted in Fig. 8, this study proposes a convolution kernel 
design with a shared MAC array to utilize the DSPs better. Fig. 
8(a) depicts the mapping of a 33 depth-wise kernel to the 
MAC array. Because the recent popular CNNs such as 
MobileNet v3, EfficientNet and EfficientDet use  11/33/55 
depth-wise kernels, the MAC array is able to process a kernel 
in one cycle. In the case where the kernel size is greater than 
77, it needs multiple cycles for kernel processing. Fig. 8(b) 
shows the detailed mapping of two output kernels computation 
to the two MAC arrays. Sixty-four multiplications for each 
normal CONV kernel are interleaved into two shared MAC 
arrays, I[63:0] W{0;1}[63:0], whereas each depth-wise kernel 
is processed  by separated MAC arrays: I_DW0[31:0] 
W_DW0[31:0] (top array), I_DW1[31:0] W_DW1[31:0] 
(bottom array). Depending on whether it is the normal or depth-
wise CONV,  
the output from the adder tree is forward to the accumulation 
unit or output buffer directly. 
2) Shortcut data reuse in different weight reuse schemes 
In order to realize ShortcutFusion, the back-end hardware 
need to support cross-layer shortcut data reuse. As illustrated in 
Fig. 9, the second inputs of the element-wise addition layer 
(shortcut layer) are fetched whenever the first inputs from the 
convolution kernels are available. Therefore, in the row-reuse 
mode, the fused layers (CONV+shortcut) require only one 
Write and two Reads instead of two Writes and three Reads 
from the off-chip memory. Similar to the row-reuse mode, the 
frame-reuse mode uses two less on-chip data movements. 
Moreover, the element-wise layer does not incur an additional 
timing overhead, thereby reducing the total latency.  
ShortcutMining [8] reuses shortcut data on-chip by reserving 
an untouched buffer space for shortcut data. Since it uses a fixed 
reuse scheme for all layers, it required a large buffer size. On 
the other hand, the proposed scheme carefully selects weight 
reuse scheme in the memory allocation for shortcut layer data, 
thereby very efficient in reducing both the total latency, off-
chip access, and on-chip buffer size. 
IV. REUSE-AWARE SHORTCUT OPTIMIZER 
Because this study supports shortcut data reuse for different 
reuse schemes, the proposed scheme is able to switch the data 
reuse between blocks of CONV layers called block-wise data 
reuse. As shown in Fig. 10, a block of layers is defined as a 
residual block or a single CNN layer which does not belong to 
any residual blocks. Given the buffer constraints, the proposed 
optimizer searches for the optimal switching between two 
 

























































































































      
(a) 2 signed 99 MULTs/DSP48E2.    (b) a shared MAC. 
Fig. 7. (a). Mapping of 2 signed 9x9 MULTs to a DSP48E2. (b). 
Shared MAC for a normal/depth-wise convolution. 
 





























weight reuse schemes (row-reuse and frame-reuse) for each 
block to get the optimized latency, on-chip buffer, and DRAM 
access. It should be noted that both the weights and feature-
maps are read from the off-chip memory exactly once.  
A cut-point is defined as the position in the CNN graph where 
the data reuse scheme switches. A CNN comprised of N basic 
blocks might have up to 2N different switching schemes. The 
exhaustive search to find the optimum data reuse policy is 
impractical for general CNNs which have hundreds of layers. 
Moreover, given a buffer constraint, it is not possible to get an 
optimized reuse policy for all blocks because each block 
requires a different constraint for its distributed buffers. There 
is an observation that, in all the recent CNNs, the feature-map 
size monotonically increases or decrease in a certain sequence 
of blocks. Therefore, this study proposes a coarse-grained 
block-wise shortcut reuse scheme which has been validated by 
recent very deep CNNs. In the proposed relaxation, a sequence 
of increasing or decreasing size blocks is assumed to have 
exactly one cut-point. Cut-points divide a CNN into segments 
as illustrated in Fig. 11. Layers in a same segment uses the same 
weight reuse scheme. The number of cut-points depends on the 
CNN architecture varying from a plain structure or residual 
style to a multi-scale, multi-branch architecture. In Fig. 11, a 
classification CNN has a single cut-point because the CNN 
structure goes from the largest scale to the smallest scale. With 
the same intuition, an auto-encoder CNN has two cut-points. 
Fig. 12 shows the details of the CNN categorization according 
to the Feature Network which might require a different number 
of cut-points. Fig. 12(a) shows the object detector with the 
Feature Pyramid Network (FPN [34]). Yolov3 [21] and 
RetinaNet [36] also use an FPN network. These CNNs require 
two cut-points. PANet [35] fuses the feature-maps both top-
down and bottom-up. Therefore, PANet requires three cut-
points. For the recent state-of-the-art object 
detector/segmentation EfficientDet [2], the number of cut-
points depends on the number of BiFPN (Bidirectional Feature 
Pyramid Network) layers. For example, if the repeated block is 
one, there are only three cut-points because there are only one 
top-down and one bottom-up path aggregation. On the other 
hand, if there are more than one repeated block, the number of 
cut-points are equal to (2repeated_blocks+1). 
The challenge is to find the cut-point positions to achieve the 
minimum latency while satisfying the buffer constraints and 
DRAM access constraints. 
A. Reuse-aware static memory allocation 
For the row-reuse mode, memory space for inputs, outputs 
and shortcuts are allocated in the off-chip memory. On the other 
hand, in the frame-reuse mode, the inputs, outputs and shortcuts 
 
(a)  Mapping a 3x3 depthwise kernel to the shared MAC array 
 
(b) Mapping of two normal/depthwise output kernels to 2 shared 
MAC arrays 












































































Fig. 9. Shortcut data reuse in row-reuse mode (left) and frame-reuse 
(right).  
 
Fig. 10. Block-wise data reuse switching. 
 
CONV1 CONV2 CONV3





























































Frame reuse Row reuse Frame reuse
A block
 
Fig. 11. Examples of a single cut-point (left) and double cut-points 
in CNNs. 
 
                                (a)                       (b)                        (c) 
Fig. 12. Categorization of CNNs according to the architecture of the 

















































































are allocated to one of the three physical buffers to eliminate 
the off-chip access. Given the CNN architecture, the memory 
allocator statically allocates buffers for each layer by assigning 
three variables {alloc_input, alloc_output, alloc_shortcut} to 
{0, 1, 2} properly to reuse the shortcut data that is stored on-
chip. It should be noted that the data of the long-path shortcut 
connection for concatenation is stored off-chip to avoid long 
lifetime data in the on-chip buffers. 
Fig. 13 shows detailed examples of the on/off-chip memory 
access management for different network structures. In Fig. 
13(a), networks with a plain structure such as VGGNet, 
Darknet19 and SimYolov2 [20] require only two buffers. On 
the other hand, the CNN with the residual block in Fig. 13(b) 
requires three buffers to reuse the shortcut data. The outputs 
from the last CONV layer in a residual block are forwarded to 
the Shortcut layer to reduce intermediate data access. 
Fig. 13(c) and 13(d) shows the memory allocation for the 
residual block with the Squeeze-and-Excitation optimization 
with different weight reuse schemes. In the row-reuse mode as 
shown in Fig. 13(c), the outputs from Global Average Pooling 
and two FC layers are stored on-chip because their size is small. 
The last layer in the SE block is a scale layer (i.e., the red 
multiplier in the figure) that works in the same way as the 1x1 
depthwise CONV layer without batch normalization. Different 
from the row-based reuse mode, the frame-reuse mode in Fig. 
13(d) completely allocates data to three on-chip buffers. The 
outputs from the depthwise CONV layer (DW CONV) are 
stored in buffer B1. In parallel, the outputs from (DW CONV + 
Pooling) are stored in buffer B0 with an offset address to avoid 
overwriting the input in buffer B0. Similar to Fig. 13(c), the 
outputs from the two FC layers and data in buffer B1 serve as 
weights and inputs for the scale layer, respectively. The residual 
block with the SE optimization is known to be inefficient in 
GPUs even though it does not incur much computation 
overhead. This dataflow-aware static memory allocation and 
on-chip data forwarding are very efficient in reducing off-chip 
memory access for residual block with SE optimization. 
B. Optimizing shortcut data reuse with given constraints 
Let us denote that L is the data reuse policy. It is noteworthy 
that layers in a same basic block use the same data reuse. To 
calculate the required size for each buffer {0,1,2} according to 
the data reuse L, the static buffer allocation for each layer i 
needs to be considered as shown at step 1 in Algorithm 1. 
{alloc_in(i), alloc_out(i), alloc_shortcut(i)} are the buffer 
allocations for the input/output/shortcut data of layer i, 
respectively. 
To derive the total raw SRAM size, the size of the following 
buffers need to be calculated: weight_buff, row_buff, out_buff, 
and write_buff. It is noteworthy that all the buffers have the 
same number of banks which are the parallelism factors Ti=To 
to remove the logic congestion of the buffer bank selection. In 
the row-reuse mode, the entire weights of a layer are pre-loaded 
to the on-chip buffer. Therefore, the required buffer size for a 
weight is as follows: 
𝑤𝑒𝑖𝑔ℎ𝑡_𝑏𝑢𝑓𝑓(𝐿) = 𝑚𝑎𝑥𝑖=𝑟𝑜𝑤_𝑟𝑒𝑢𝑠𝑒𝑤𝑒𝑖𝑔ℎ𝑡_𝑠𝑖𝑧𝑒(𝑖)    (1) 
It should be noted that the double weight block buffer for 
feeding weight blocks to parallel convolution kernels is stored 
in the LUT-RAMs of the FPGA chip because its depth is small 
(233 = 18). Because buffer 1 is shared for both feature-maps 
and weights, the size of buffer 1 is as follows: 
𝑏𝑢𝑓𝑓[1](𝐿) = 𝑚𝑎𝑥⁡(𝑤𝑒𝑖𝑔ℎ𝑡_𝑏𝑢𝑓𝑓(𝐿),  𝑏𝑢𝑓𝑓[1](𝐿))⁡    (2) 
The proposed convolutional kernel design focuses on the 3x3 
normal convolution as in most of the CNNs and 1x1/3x3/5x5 
depthwise convolutions for EfficientNet/Det. However, it can 
also support convolutions with a filter size larger than 7x7 by 
merely increasing the number of row buffers and double weight 
block buffer depth. Therefore, in the proposed design, there are 
six rows in the row buffer including one row for input 
prefetching: 
𝑟𝑜𝑤_𝑏𝑢𝑓𝑓(𝐿) =  𝑚𝑎𝑥𝑖6 × 𝑖𝑛_𝑟𝑜𝑤𝑠𝑖𝑧𝑒(𝑖) = 𝑚𝑎𝑥𝑖6 × 𝑤𝑖 × 𝑁𝑖 
(3) 
where wi and Ni are the width and input channels, 
respectively. Regarding the buffer for temporarily partial sums, 
the buffer size for the frame-based reuse mode is larger than 
that for the row-based reuse mode because the row-reuse mode 
needs to buffer only one row. Therefore, the partial sum buffer 
(4-byte width) is derived as below: 
𝑜𝑢𝑡_𝑏𝑢𝑓𝑓(𝐿) =  𝑚𝑎𝑥𝑖=𝑓𝑟𝑎𝑚𝑒_𝑟𝑒𝑢𝑠𝑒𝑜𝑢𝑡𝑏𝑢𝑓𝑓(𝑖) 
= 𝑚𝑎𝑥𝑖=𝑓𝑟𝑎𝑚𝑒_𝑟𝑒𝑢𝑠𝑒𝑜𝑢𝑡_𝑤𝑖 × 𝑜𝑢𝑡_ℎ𝑖 × 𝑇𝑜 × 4       (4) 
Finally, the output from the accelerator is buffered in the 
write buffer before writing to the off-chip memory. In the 
frame-based reuse mode, except for the final outputs of the 
CNN and long-path shortcut/concatenation, all intermediate 
data are stored on-chip. Therefore, the size of the write buffer 
is as below: 
𝑤𝑟𝑖𝑡𝑒_𝑏𝑢𝑓𝑓(𝐿) = max⁡(𝑚𝑎𝑥𝑖=𝑟𝑜𝑤_𝑟𝑒𝑢𝑠𝑒𝑤𝑟𝑖𝑡𝑒_𝑏𝑢𝑓𝑓(𝑖),   
𝑚𝑎𝑥𝑖=𝑓𝑟𝑎𝑚𝑒_𝑟𝑒𝑢𝑠𝑒&𝑖=𝑓𝑖𝑛𝑎𝑙_𝑙𝑎𝑦𝑒𝑟𝑠𝑤𝑟𝑖𝑡𝑒_𝑏𝑢𝑓𝑓(𝑖)) 
= max⁡(𝑚𝑎𝑥𝑖=𝑟𝑜𝑤_𝑟𝑒𝑢𝑠𝑒𝑜𝑢𝑡_𝑤𝑖 × 𝑇𝑜,   
 
     (a)                    (b)                              (c)                          (d) 
Fig. 13. On/Off-chip buffer management in ShortcutFusion. (a). Plain 
network. (b). Network with the residual block. (c) Residual block w/ 
the Squeeze-and-Excitation (SE) block in row-based weight reuse. (d). 








































































𝑚𝑎𝑥𝑖=𝑓𝑟𝑎𝑚𝑒_𝑟𝑒𝑢𝑠𝑒&𝑖=𝑓𝑖𝑛𝑎𝑙_𝑙𝑎𝑦𝑒𝑟𝑠𝑜𝑢𝑡_𝑤𝑖 × 𝑜𝑢𝑡_ℎ𝑖 × 𝑇𝑜) (5) 
To sum them up, the required raw SRAM size is as follow: 
𝑆𝑅𝐴𝑀𝑠𝑖𝑧𝑒(𝐿) = 𝑟𝑜𝑤_𝑏𝑢𝑓𝑓(𝐿) + 𝑜𝑢𝑡_𝑏𝑢𝑓𝑓(𝐿) 
+𝑤𝑟𝑖𝑡𝑒_𝑏𝑢𝑓𝑓(𝐿) + 𝑏𝑢𝑓𝑓[0](𝐿) + 𝑏𝑢𝑓𝑓[1](𝐿) + 𝑏𝑢𝑓𝑓[2](𝐿) (6) 
The raw SRAM size does not physically reflect the real 
BRAM utilization of a FPGA chip. Therefore, the number of 







]                     (7) 
Regarding the necessity of reducing the total DRAM access 
in a CNN computation, the previous study in [37] shows that 
the energy consumed by an off-chip access is much larger than 
that by an on-chip access or arithmetic operation in an ASIC 
chip. Therefore, it is important to constrain the off-chip access. 
The proposed design supports data forwarding as discussed in 
section III-B-2. Hence, the DRAM access for feature-maps, and 
total DRAM access are calculated as shown in (8), (9), 
respectively. 






+∑ 2 × 𝑖𝑛_𝑠𝑖𝑧𝑒(𝑖)𝑖=𝑓𝑟𝑎𝑚𝑒_𝑟𝑒𝑢𝑠𝑒,𝑖=𝑐𝑜𝑛𝑐𝑎𝑡/𝑟𝑜𝑢𝑡𝑒             (8) 
𝑇𝑜𝑡𝑎𝑙𝐷𝑅𝐴𝑀(𝐿) = 𝐷𝑅𝐴𝑀_𝐹𝑀(𝐿) + ∑ 𝑤𝑒𝑖𝑔ℎ𝑡_𝑠𝑖𝑧𝑒(𝑖)𝑖      (9) 
The optimization problem is to find the data reuse policy L 
as follows: 
𝑚𝑖𝑛𝑙(𝑙𝑎𝑡𝑒𝑛𝑐𝑦(𝐿, 𝑇𝑖 , 𝑇𝑜)) 𝑠. 𝑡.: 
𝐷𝑆𝑃(𝑇𝑖 , 𝑇𝑜) < 𝛼                                           (*) 
𝐵𝑅𝐴𝑀18𝐾(𝐿, 𝑇𝑖 , 𝑇𝑜) < 𝛽 
𝑤𝑒𝑖𝑔ℎ𝑡𝑠⁡𝑎𝑐𝑐𝑒𝑠𝑠 = 1, 𝑓𝑒𝑎𝑡𝑢𝑟𝑒⁡𝑚𝑎𝑝𝑠⁡𝑎𝑐𝑐𝑒𝑠𝑠⁡ ≤ 1     (10) 
where the constraint (10) is explained in section III-A. 
A problem is raised such that the latency estimation by 
running the RTL simulation for each candidate takes a very 
long time. Because the number of candidates in the design space 
can be large, the RTL simulation approach is not feasible. 
Therefore, this work built a cycle-accurate timing simulator to 
estimate the latency of a CNN layer running different reuse 
schemes as described in Fig. 3 and Fig. 4. The latency 
estimation model was verified with the RTL simulation for the 
CNN with all the candidates for policy L in the design space. 
Single cut-point optimization: In Fig. 11 (left), the i-th layer is 
in the row-reuse mode if i < L and the frame-reuse mode, 
otherwise. 
Multiple cut-points optimization: Fig. 14 shows an example 
of double cut-points in the RetinaNet. Network “cut” is the 
position that divides the RetinaNet into two sub-networks: from 
the beginning to the smallest scale and from the smallest scale 
to the end. From this network cut, the relative position of the 
data reuse policy L=( L1, L2) is defined, as shown in Fig. 15, 
where 0 ≤ L1 < N1, and 0 ≤ L2 < N2. The real layer indexes of 
L are (L1, N1 + L2). Layer i=row_reuse if i < L1 || i ≥ N1 + L2, 
and i=frame_reuse, otherwise. The other multiple cut-point 
cases can be extended with a same exhaustive search for the 
optimum policy in the polynomial of time O(Nk), where N and 
k are the depth of the sub-networks and the number of the cut-
points in the CNN, respectively. 
Algorithm 1: Calculation of the required buffer size. 
Input: CNN architecture with N layers 
Output: required buffer size w.r.t. the data reuse 
policy L: buff[0](L), buff[1](L), buff[2](L) 
 
For each layer i in frame-reuse do 
 1. {alloc_in(i), alloc_out(i), alloc_shortcut(i)}  
      = buff_alloc(layer i);   
 2. buff[alloc_in(i)](L) =  
      max(buff[alloc_in(i)](L), input_size(i));  
 3. if (to_residual(i) == yes) // layer i is used for 
residual layer 
      buff[alloc_shortcut(i)](L) = 
      max(buff[alloc_shortcut(i)](L),output_size(i));   
 4. if (NextLayer(i) == Maxpool2x2) // fused conv+pool 
      buff[alloc_out(i)](L) =  
      max(buff[alloc_out(i)](L), output_size(i)/4); 
    else 
      buff[alloc_out(i)](L) =  






Fig. 14. RetinaNet with a network “cut.” 
 
Fig. 15. Double cut-points (L1, L2) in RetinaNet. 
 
 
class & box subnet
class & box subnet
class & box subnet
class & box subnet






















































































    
                                  (a)                                                                               (b)                                                                                  (c) 

















































































































































































V. EXPERIMENTAL RESULTS 
A. Reuse-aware shortcut optimizer 
Fig. 16(a), and 17(b) show the buffer size, DRAM access and 
latency (i.e., inference time per single image) with regard to the 
cut-point position for YOLO v2. The minimum SRAM required 
is 0.76 MB corresponding to layer 12 (CONV9). Compared to 
the baseline which uses a fixed row-based weight reuse scheme, 
the proposed scheme achieves a 2.17 times speed-up, as shown 
in Fig. 16(c), while requiring a 5.73 times smaller buffer size. 
The speed up is seen from CONV9 because the proposed 
scheme reuses feature-maps on-chip and the weight load time 
is hidden by frame-based computation (as shown in Fig. 4(a)). 
Fig. 17 shows the performances of the various CNNs: YOLO 
v3 (77 CONV layers), ResNet152 (152 CONV layers), and 
EfficientNet-B1 (139 CONV layers) with respect to the 
switching point positions. The optimizer provides an exhaustive 
search to find the minimum buffer size that satisfies the DRAM 
access constraints. It should be noted that in Fig. 17, the weights 
are accessed exactly once. Hence, the total DRAM access is 
always larger than the DRAM access for the feature-maps (FM) 
by the amount of the weight size. It can be seen that for all 
CNNs, the cut-point at the beginning achieves a better latency 
at the cost of a larger buffer size. As long as the buffer 
constraints are satisfied, the frame-based weight reuse scheme 
is better than the row-based weight reuse scheme in terms of 
both latency and DRAM access reduction. 
Table I presents the comparison of the proposed design over 
the previous work on the ResNet152 inference. For a fair 
comparison, the proposed accelerator is designed with 16-bit 
precision, and the BRAMs constraint is set similar to Shortcut 
Mining in [8]. In this case, each multiplication is mapped to a 
single DSP. The proposed scheme achieves a similar DSP 
efficiency while reducing the off-chip access for the weights 
and feature-maps significantly. Shortcut Mining uses a large 
number of parallel buffer banks which are shared for both the 
feature-maps and partial sums. Because the bit width of the 
partial sums is many fold larger than that of the feature-maps, 
some of the buffer space might be wasted. In addition to that, a 
fixed data reuse scheme in [8] might require a very high BRAM 
utilization and frequent off-chip access. 
B. Minimum buffer size requirement to satisfy the DRAM 
access constraints 
Table II shows the minimum required buffer size for various 
CNNs to meet the DRAM access constraints (equation 10). 
These buffer sizes are not only practical for small to medium 
     
                   (a). YOLO v3 buffer size                             (b). YOLO v3 DRAM access & Latency                                 (c). ResNet152 buffer size 
 
    
     (d). ResNet152 DRAM access & Latency                    (e). EfficientNet-B1 buffer size                           (f). EfficientNet-B1 DRAM access & Latency 






























































































































































































































































































































































































































RESNET152 - PERFORMANCE COMPARISON TO PREVIOUS WORKS 
Features HPCA’19 [8] Proposed 
FPGA board VC707 KCU1500 
Frequency 150 MHz 200 MHz 
Logics (K) 283.8 (86%) 215.3 (33%) 
DSPs 2800 (100%) 2240 (41%) 
BRAM18K 2040 (99%) 1945 (45%) 
Input size 224x224 224x224 
Precision 16-bit 16-bit 
Weights (MB) 112.6 MB 112.6 MB 
Throughput 608.3 GOPS 607.5 GOPS 
DSP efficiency 72.4% 71.1% 
Weight Load Multiple times Once 




MINIMUM BUFFER SIZE FOR EACH CNN 





YOLO v2 416x416 21 0.762 MB 
VGG-CONV 224x224 13 0.712 MB 
YOLO v3 416x416 106 1.682 MB 
RetinaNet 512x512 137 2.392 MB 
Resnet50/152 224x224 68/204 1.039 MB 
EfficientNet-B1 256x256 181 0.43 MB 
[*] including other layers such as shortcut, concatenation, etc. 
 
TABLE III 
BUFFER SIZE TO MINIMIZED OFF-CHIP ACCESS FOR VGG-CONV 
 OLAccel[38] SmartShuttle[12] Proposed 
Networks VGG-CONV 
Precision Mixed (4,8) 8-bit 8-bit 
SRAM size  2.4 MB 0.75 MB 0.712 MB 
DRAM access 42.8 MB 58.1 MB 42.8 MB 
 
 9 
size FPGA chips but also ASIC chips where the size of the 
SRAM might dictate the chip size. For example, the Google 
TPU [28] consists of 28 MiB of on-chip buffer which accounts 
for 30% of the chip area. The DRAM access constraints limit 
the DRAM accesses which are also important to reduce the 
energy consumption for ASIC chips [37]. Table III compares 
the buffer size and DRAM access of the proposed scheme 
compared to previous works for the VGG16 CONV layers. 
With the same amount of DRAM access (i.e., input/output are 
accessed once), the proposed scheme requires a 3.4 times 
smaller buffer than [38] due to the adaptive reuse policy. 
Compared to SmartShuttle which proposes layer-wise data 
reuse schemes, the proposed scheme reduces DRAM access by 
1.36 times with smaller on-chip buffer size. This demonstrates 
the efficiency of adaptive switching between row-based weight 
reuse schemes and frame-based weight reuse schemes over the 
reuse schemes using the tiled input/output in SmartShuttle. 
Finally, the performance of various state-of-the-art CNNs 
using the proposed scheme are shown in Table IV. Depending 
on the on-chip buffer constraints, the proposed scheme 
minimizes the DRAM access for the feature-maps while 
accessing parameters exactly once for all the CNNs due to strict 
off-chip access constraints in the proposed optimization. 
Compared to the baseline, in which all data are accessed off-
chip exactly once, the proposed scheme reduces the total 
DRAM access by 47.8-84.8% for the various CNNs. 
Table V shows the comparison of some end-to-end 
frameworks for the ResNet50 inference. All three previous 
works utilize large Ultra-RAM of a cloud-scale Xilinx FPGA 
which has 6840 DSPs and 270 Mb of Ultra-RAM. However, 
none of them support flexible data reuse. Whereas, the proposed 
framework supports adaptive data reuse schemes with in-
hardware shortcut fusion, thereby completely removing the off-
chip access for intermediate data including the shortcut data 
while using less SRAM resource. Compared to Cloud-DNN, 
the proposed work utilizes 7.4 less SRAM resource while 
having 1.07 higher DSP efficiency. In addition, the proposed 
work achieves a competitive GOPS and 2.4 higher DSP 
efficiency than ML-Suite while requiring 6.0 less SRAM 
resource and running at 2.5 lower frequency.  
C. Scalability and power efficiency for SOTA CNNs 
A larger input size leads to higher accuracy while causing an 
increase in the on-chip buffer requirement, DRAM access and 
latency. Table VI presents the performance of the EfficientNet-
 
TABLE IV 
PERFORMANCE OF THE VARIOUS CNNs USING THE PROPOSED SCHEME 
 ResNet50 ResNet152 Yolo v2 YOLO v3 RetinaNet EfficientNet-B1 
Platform Xilinx KCU1500 
Frequency 200 MHz 
Data format 8-bit 
Input size 256x256 256x256 416x416 416x416 512x512 256x256 
CNN Size (GOP) 11.76 31.16 17.18 65.86 102.2 1.38 
LUTs/FFs (K) 212.7/361.5 212.7/361.5 203.1/331.0 213.3/352.0 264.3/367.2 264.1/375.7 
DSP utilization 2240 2240 2240 2240 2240 2240 
BRAM18k 2368 (55%) 2368 (55%) 2304 (53%) 3020 (70%) 3766 (87%) 2594 (50%) 
Latency (ms) 11.69 26.78 14.73 57.57 93.16 4.69 
GOPS 1006 1163 1166 1142 1097 317.1 
MAC Efficiency 61.4% 71.0% 71.2% 69.7% 67.0% 19.37% 
Weight load Once Once Once Once Once Once 
Off-chip FMs 0.19 MB 0.19 MB 0.66 MB 90.6 MB 136.4 MB 0.19 MB 
Total off-chip [*] 59.09 MB 130.2 MB 48.9 MB 153.5 MB 261.34 MB 60.7 MB 
Off-chip reduction 60.62% 56.7% 70.31% 60.34% 47.81% 84.81% 
[*] Total off-chip memory access if data (weights/inputs/outputs) are accessed exactly once. 
 
TABLE V 


















URAM size 270 Mb -- 
Frequency 500 MHz 125 MHz 214 MHz 200 MHz 
Framework Tensorflow Tensorflow Caffe Tensorflow 
Network ResNet50 
Input size 224x224 224x224 224x224 256x256 
Precision 8-bit 8-bit 16-bit 8-bit 
Latency 7.77ms 23.8ms 8.12ms 11.9ms 
LUTs 612K 605K 696K 217K 
DSPs 5493 6005 5489 2240 
GOPS 1290 328 1235 1006 
Data reuse Fixed Fixed Fixed Flexible 
Shortcut reuse 
& fusion in HW 
No No No Yes 
SRAM size 
(MB) 
31.2 [*] 18.8 [*] 38.3 [*] 5.2 
DSP 
efficiency 
23.47% 21.85% 52.58% 56.14% 
[*]: Ultra-RAM + BRAMs utilization 
 
TABLE VI 
EFFICIENTNET-B1 INFERENCE PERFORMANCE ON THE PROPOSED DESIGN 
Resolution 256256 512512 768768 
FPGA board KCU1500 
Frequency 200 MHz 
LUTs/FFs (K) 264.1/375.7 264.5/375.5 271.7/375.4 
DSPs 2176 
BRAM18Ks 2594 (60%) 2723 (62%) 3845 (89%) 
GOPS 317.1 267.4 274.4 
DSP efficiency 19.37% 16.3% 16.75% 
Off-chip FMs 0.19 MB 144 MB 344 MB 
Total off-chip [*] 60.7 MB 216 MB 475 MB 
Off-chip Weights Once 
Off-chip reduction 84.81% 29.2% 27.6% 
Power (W) 21.09 23.76 26.71 
GOPS/W 15.0 11.3 10.3 
[*]: Total off-chip memory access if weights/inputs/outputs are accessed 
from DRAM exactly once. 
 10 
B1 inference with various high resolution images to 
demonstrate the scalability of the proposed scheme. For 
example, with a 768768 input size, the total DRAM access if 
the inputs/outputs are accessed exactly once is 475 MB. Out of 
this, the proposed scheme requires only 344 MB for the feature-
maps access which results in a 27.6% reduction. The reduction 
is 29.2% and 84.81% for 512512 and 256256, respectively. 
The power of the accelerator is estimated as the sum of 
FPGA-chip power plus the DRAM power. The FPGA-chip 
power is calculated by Xilinx Power Estimator with the signal 
switching frequency from RTL simulation. The DRAM access 
energy is estimated from the total DRAM access and the energy 
per access from [56]. The CPU for experiment is Intel Xeon E3-
1245 v5 3.5GHz with OpenMP enable. Meanwhile, GPU is 
tested with CUDA 10.0 and CuDNN. GPU power is calculated 
by nvidia-smi. Fig. 18 shows a detailed comparison of the 
proposed work with CPUs/GPUs. NVIDIA RTX 2080 Ti 
outperforms three other CPUs/GPUs in terms of both speed and 
power efficiency. In terms of speed, this work is 2.23, 1.35, and 
1.45 times faster than NVIDIA 2080 Ti for 256256, 512512, 
and 768768 input sizes, respectively. The reasons are the 
reduction of the DRAM access and more efficient hardware 
utilization for the residual block with SE optimization. In terms 
of power efficiency, this work is 6.3-7.9 times more power 
efficient than that of NVIDIA 2080 Ti. The DSP efficiency is 
low for EfficientNet (e.g., less than 20%) due to low density of 
multiplications in depthwise convolution. However, compared 
to GPUs, which has huge number of parallelism, the proposed 
design still shows significant speed up thanks to the reuse-
aware static memory allocation and shared MAC design. 
VI. RELATED WORKS 
Hardware accelerators use various dataflows to increase 
resource utilization, for example, weight stationary [7], [8], [16], 
[28], [48], [53], output stationary [49], and row stationary [13], 
[50], [51]. MAESTRO [52] analyzes the energy-performance 
trade-off for the various dataflows above to choose an 
optimized one for a given CNN. These fixed dataflow designs 
result in a sub-optimal on-chip buffer size and off-chip memory 
access when running different layers of a CNN with different 
characteristics. FlexFlow [11] presents an optimization from 
the on-chip buffer to the PEs enabling the mixing of multiple 
parallelism types of the feature-maps, neurons, and synapses to 
boost resource utilization which is orthogonal to the proposed 
work. DNA [10], and SmartShuttle [12] propose using a layer-
wise data reuse scheme which supports switching between two 
of the three schemes: Input-Reuse, Output-Reuse and Weight-
Reuse. These works reduce the off-chip access efficiently 
compared to previous works which have a similar global buffer 
size. However, in these works, a larger buffer size (i.e., 512 KB) 
has less benefits even though the accelerator supports a flexible 
tile size. 
In the literature, there are some compilers to schedule a 
FPGA-based CNN accelerator such as TVM [43], Xilinx ML-
Suite [44], Intel DLA [45], and DNNVM [46]. TVM optimizes 
the standard CNN inference on the software side for a vanilla 
deep learning accelerator by overlapping the tensor 
computation with the memory load/store operations. However, 
the machine learning-based optimization for tuning each 
convolution layer has a considerable time cost which is 
burdensome for very deep networks. DLA optimizes CNN 
graph by adding a 1x1 identity layer and merging element-wise 
addition to the previous layer. Finally, ML-Suite and DNNVM 
fuse many adjacent layers such as convolution, batch norm, 
Relu, and pooling to reduce the off-chip access for intermediate 
data. Nevertheless, the lack of an in-hardware flexible data 
reuse and shortcut reuse reduces the MAC efficiency. Whereas, 
the proposed hardware/software co-optimizer, while being fast, 
provides adaptive data reuses to minimize the off-chip memory 
access and improve the MAC efficiency even with the on-chip 
buffer constraints. 
VII. CONCLUSION 
This paper presents a tool for FPGA-based CNN inference 
which uses a reuse-aware shortcut optimizer to minimize the 
latency, the off-chip memory access and improve the MAC 
efficiency given the on-chip buffer constraints. Comprehensive 
comparisons to previous works demonstrate the efficiency of 
the proposed approach. In addition, the proposed work achieves 
superior performance compared to NVIDIA GPUs when 
running state-of-the-art Squeeze-and-Excitation-based CNNs 
such as EfficientNets/EfficientDet/MobileNet v3. 
REFERENCES 
[1] M. Tan, Q. V. Le, “EfficientNet: Rethinking Model Scaling for 
Convolutional Neural Networks,” [Online]. Available: 
arxiv.org/abs/1905.11946, 2020. 
[2] M. Tan, R. Pang, Q. V. Le, “EfficientDet: Scalable and Efficient Object 
Detection,” [Online]. Available:  arxiv.org/abs/1911.09070v7, 2020. 
[3] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, 
M. Andreetto, H. Adam, “MobileNets: Efficient Convolutional Neural 
Networks for Mobile Vision Applications,” [Online]. Available: 
arxiv.org/abs/1704.04861, 2017. 
[4] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, 
“MobileNetV2: Inverted Residuals and Linear Bottlenecks,” in Proc. 
IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. 
   
                  (a) Latency (ms) w/ different input size                                                     (b) Power and Power efficiency 





















CPU (i7) w/ OpenMP
NVIDIA GTX 1080 Ti
NVIDIA Titan Xp





































































) Power Power Efficiency
 11 
[5] X. Zhang, X. Zhou, M. Lin, J. Sun, “ShuffleNet: An Extremely Efficient 
Convolutional Neural Network for Mobile Devices,” ,” in Proc. IEEE 
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017. 
[6] J. Hu, L. Shen, G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE 
Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018. 
[7] M. Alwani, H. Chen, M. Ferdman, and P. Milder, “Fused-layer CNN 
accelerators,” in Proc. IEEE/ACM Int. Symp. Microarchitecture 
(MICRO), 2016. 
[8] A. Azizimazreah, L. Chen, “Shortcut Mining: Exploiting Cross-layer 
Shortcut reuse in DCNN Accelerator,” in Proc. IEEE Int. Symp. High Perf. 
Comput. Archit. (HPCA), 2019. 
[9] X. Chen, Y. Han, Y. Wang, “Communication lower bound in 
convolutional accelerator,” in Proc. IEEE Int. Symp. High Perf. Comput. 
Archit. (HPCA), 2020. 
[10] F. Tu, S. Yin, P. Ouyang, S. Tang, L. Liu, S. Wei, “DNA: Deep 
Convolutional Neural Network Architecture with Reconfigurable 
Computation Patterns,” in IEEE Trans. Very Large Scale Integr. (VLSI) 
Syst. , vol. 25, no. 8, pp. 2220-2233, 2017. 
[11] W. Lu, G. Yan, J. Li, S. Gong, Y. Han, X. Li, “FlexFlow: A Flexible 
Dataflow Accelerator Architecture for Convolutional Neural Networks,” 
in Proc. IEEE Int. Symp. High Perf. Comput. Archit. (HPCA), 2017. 
[12] J. Li, G. Yan, W. Lu, S. Jiang, S. Gong, J. Wu, X. Li, “SmartShuttle: 
Optimizing Off-Chip Memory Accesses for Deep Learning Accelerators,” 
in Proc. Des. Auto. & Test in Europe Conf. (DATE), 2018. 
[13] Y.-H. Chen, T.-J Yang, J. Emer, V. Sze, “Eyeriss v2: A Flexible 
Accelerator for Emerging Deep Neural Networks on Mobile Devices,” in 
IEEE Journal of Emerg. Sel. Topics Circuits Syst. (JETCAS), vol. 9, no. 
2, pp. 292-308, 2019. 
[14] L. Bai, Y. Zhao, X. Huang, “A CNN accelerator on FPGA using 
depthwise Separable Convolution,” in IEEE Trans. Circuit Syst.-II: 
Express briefs, Vol. 65, No. 10, 2018. 
[15] Y. Guan, H. Lang, N. Xu, W. Wang, S. Shi, X. Chen, G. Sun, W. Zhang, 
J. Cong, “FP-DNN: An Automated Framework for Mapping Deep Neural 
Networks onto FPGAs with RTL-HLS Hybrid Templates,” in IEEE Annu. 
Int. Symp. Field-Programmable Custom Comput. Machine (FCCM), 
2017. 
[16] Y. Shen, T. Ji, M. Ferdman, P. Milder, “Argus: An End-to-End 
Framework for Accelerating CNNs on FPGAs,” in Proc. IEEE/ACM Int. 
Symp. Microarchitecture (MICRO), 2019. 
[17] Y. Chen, J. He, X. Zhang, C. Hao, D. Chen, “Cloud-DNN: An Open 
Framework for Mapping DNN Models to Cloud FPGAs,” in Proc. 
ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), 2019. 
[18] M. Blott, T. B. Preuber, N. J. Fraser, G. Gambardella, K. O’brien, Y. 
Omuroglu, M. Leeser, K. Vissers, “FINN-R: An End-to-End Deep-
Learning Framework for Fast Exploration of Quantized Neural 
Networks,” in ACM Trans.  Reconfigurable Technol. Syst. (TRETS), vol. 
11, no. 3, Artical 16, pp. 1-23, 2018. 
[19] X. Zhang, X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W. Hwu, D. Chen. 
“DNNBuilder: an Automated Tool for Building High-Performance DNN 
Hardware Accelerators for FPGAs,” in Proc. Int. Conf. Comput-Aided. 
Des. (ICCAD), 2018. 
[20] J. Redmon, A. Farhadi, “YOLO9000: Better, Faster, Stronger,” [Online]. 
Available: arxiv.org/abs/1612.08242. 
[21] J. Redmon, A. Farhadi, “YOLOv3: An Incremental Improvement,” 
[Online]. Available: arxiv.org/abs/1804.02767. 
[22] K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image 
recognition”, [Online]. Available: arxiv.org/abs/1512.03385. 
[23] D. T. Nguyen, T. N. Nguyen, H. Kim, and H.-J Lee. “A High-Throughput 
and Power-Efficient FPGA Implementation of YOLO CNN for Object 
Detection,” in IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 27, 
no. 8, pp. 1861-1873, 2019. 
[24] J. Redmon, “Darknet: An Open Source Neural Networks in C,” 2013. 
[25] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing 
FPGA-based accelerator design for deep convolutional neural networks,” 
in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), 
2015. 
[26] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net: 
Training low bitwidth convolutional neural networks with low bitwidth 
gradients,” [Online]. Available: arxiv.org/abs/1606.06160. 
[27] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep 
neural networks with pruning, trained quantization and huffman coding,” 
[Online]. Available: arxiv.org/abs/1510.00149. 
[28] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. 
Bates, S. Bhatia, N. Boden, A. Borchers, et al., “In datacenter performance 
analysis of a tensor processing unit,” in Proc. Int. Symp. Comput. 
Archit.(ISCA), 2017. 
[29] M. Abadi et al.,. TensorFlow: Large-scale machine learning on 
heterogeneous systems, 2015. Software available from tensorflow.org. 
[30] Paszke et al., PyTorch: An Imperative Style, High-Performance Deep 
Learning Library. Available: https://pytorch.org. 
[31] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long , “Caffe: 
Convolutional architecture for fast feature embedding”, in Proc. 22nd 
ACM Int. Conf. Multimed., 2014. 
[32] Ultrascale and Ultrascale+ FPGA, “Deep Learning with INT8 
Optimization on Xilinx Devices,” Xilinx, 2017. 
[33] S. Hadjis, K. Olukuntun, “Tensorflow to cloud FPGA: Tradeoffs for 
accelerating Deep Neural Networks,” Proc. 22nd ACM Int. Conf. 
Multimed., 2019. 
[34] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, 
“Feature pyramid networks for object detection,” ,” in Proc. IEEE Conf. 
Comput. Vis. Pattern Recognit. (CVPR), 2017. 
[35] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for 
instance segmentation,” ,” in Proc. IEEE Conf. Comput. Vis. Pattern 
Recognit. (CVPR), 2018. 
[36] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, 
“Focal loss for dense object detection,” in Proc. Int. Conf. Comput. Vis. 
(ICCV), 2017. 
[37] S. Han, J. Pool, J. Tran, W. J. Dally, “Learning both weights and 
connections for efficient neural network,”. [Online]. Available: 
arxiv.org/abs/1506.02626. 
[38] E. Park, D. Kim, S. Yoo, “Energy-efficient Neural Network Accelerator 
Based on Outlier-ware Low-precision Computation,” in Proc. Int. Symp. 
Comput. Archit.(ISCA), 2018. 
[39] TensorRT, June 2019. https://developer.nvidia.com/tensorrt. 
[40] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, 
Y. Zhu, R. Pang, V. Vasudevan, Quoc V. Le, H. Adam, “ Searching for 
MobileNetV3,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2019. 
[41] https://ai.googleblog.com/2019/08/efficientnet-edgetpu-creating.html. 
[42] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, 
H. Esmeilzadeh, “From high-level deep neural models to FPGAs,” in 
Proc. IEEE/ACM Int. Symp. Microarchitecture (MICRO), 2016. 
[43] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. 
Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “TVM: An 
automated end-to-end optimizing compiler for deep learning,” in USENIX 
Symp. Operating Syst. Design and Implementation (OSDI), 2018. 
[44] Xilinx, “Xilinx DNN processor: An inference engine, network compiler 
and runtime for Xilinx FPGAs.” in Hot Chips, 2018. 
[45] M. S. Abdelfattah, D. Han, A, Bitar, R. DiCecco, S. O'Connell, N. 
Shanker, J. Chu, I. Prins, J. Fender, A. C. Ling, G. R. Chiu, “DLA: 
Compiler and FPGA overlay for neural network inference acceleration,” 
in Proc. Int. Conf. Field Programmable Logic and Appl. (FPL), 2018. 
[46] Y. Xing , S. Liang, L. Sui, X. Jia, J. Qiu, X. Liu, Y. Wang, Y. Shan, and 
Y. Wang, “DNNVM: End-to-End Compiler Leveraging Heterogeneous 
Optimizations on FPGA-Based CNN Accelerators,” in IEEE Trans.  
Comput.-Aided Des. of Integr. Circuits Syst. (TCAD), 2020. 
[47] Z. Qin, Z. Zhang, D. Li, Y. Zhang, Y. Peng, “Diagonalwise 
Refactorization: An Efficient Training Method for Depthwise 
Convolutions,” [Online]. Available: arxiv.org/pdf/1803.09926, 2018. 
[48] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, 
S. Song et al., “Going deeper with embedded fpga platform for 
convolutional neural network,” in Proc. ACM/SIGDA Symp. Field-
Program. Gate Arrays (FPGA), 2016. 
[49] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, 
and O. Temam, “Shidiannao: Shifting vision processing closer to the 
sensor,” in Proc. Int. Symp. Comput. Archit.(ISCA), 2015. 
[50] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable 
and efficient neural network acceleration with 3d memory,” in Proc. Int. 
Conf. Archit. Support Program. Languages Operating Syst. (ASPLOS), 
2017. 
 12 
[51] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A spatial architecture for 
energy-efficient dataflow for convolutional neural networks,” in Proc. Int. 
Symp. Comput. Archit. (ISCA), 2016. 
[52] H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. 
Krishna, “Understanding reuse, performance, and hardware cost of dnn 
dataflow: A data-centric approach,” in Proc. IEEE/ACM Int. Symp. 
Microarchitecture (MICRO), 2019. 
[53] Y. Shen, M. Ferdman, P. Milder, “Maximizing CNN Accelerator 
Efficiency Through resource partitioning,” in Proc. Int. Symp. Comput. 
Archit. (ISCA), 2017. 
[54] A. Parashar, P. Raina, Y. S. Shao, Y. Chen, V. A. Ying, A. Mukkara, R. 
Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A 
systematic approach to dnn accelerator evaluation,” in Proc. IEEE Int. 
Symp. Perf. Anal. Syst. Softw. (ISPASS), March 2019. 
[55] X. Yang, M. Gao, Q. Liu, J. Setter, J. Pu, A. Nayak, S. Bell, K. Cao, H. 
Ha, P. Raina, C. Kozyrakis, and M. Horowitz, "Interstellar: Using 
Halide’s Scheduling Language to Analyze DNN Accelerators," in Proc. 
Int. Conf. Archit. Support Program. Languages Operating Syst. 
(ASPLOS), Mar 2020. 
[56] K. T. Malladi, B. C. Lee, F. A. Nothaft, C. Kozyrakis, K. Periyathambi, 
and M. Horowitz, “Towards energy-proportional datacenter memory with 
mobile DRAM,” in Proc. Int. Symp. Comput. Archit. (ISCA), 2012. 
 
 
