Synetgy: Algorithm-hardware Co-design for ConvNet Accelerators on
  Embedded FPGAs by Yang, Yifan et al.
Synetgy: Algorithm-hardware Co-design for ConvNet
Accelerators on Embedded FPGAs
Yifan Yang1,2,∗, Qijing Huang1, Bichen Wu1, Tianjun Zhang1, Liang Ma3, Giulio Gambardella4,
Michaela Blott4, Luciano Lavagno3, Kees Vissers4, John Wawrzynek1, Kurt Keutzer1
1UC Berkeley; 2Tsinghua University; 3Politecnico di Torino; 4Xilinx Research Labs
{yifan-yang,qijing.huang,bichen,tianjunz,johnw,keutzer}@berkeley.edu;
{luciano.lavagno,liang-ma}@polito.it;{giuliog,mblott,keesv}@xilinx.com
ABSTRACT
Using FPGAs to accelerate ConvNets has attracted significant at-
tention in recent years. However, FPGA accelerator design has
not leveraged the latest progress of ConvNets. As a result, the key
application characteristics such as frames-per-second (FPS) are
ignored in favor of simply counting GOPs, and results on accu-
racy, which is critical to application success, are often not even
reported. In this work, we adopt an algorithm-hardware co-design
approach to develop a ConvNet accelerator called Synetgy and a
novel ConvNet model called DiracDeltaNet†. Both the accelerator
and ConvNet are tailored to FPGA requirements. DiracDeltaNet,
as the name suggests, is a ConvNet with only 1 × 1 convolutions
while spatial convolutions are replaced by more efficient shift oper-
ations. DiracDeltaNet achieves competitive accuracy on ImageNet
(88.7% top-5), but with 42× fewer parameters and 48× fewer OPs
than VGG16. We further quantize DiracDeltaNet’s weights to 4-bit
and activations to 4-bits, with less than 1% accuracy loss. These
quantizations exploit well the nature of FPGA hardware. In short,
DiracDeltaNet’s small model size, low computational OP count, low
precision and simplified operators allow us to co-design a highly
customized computing unit for an FPGA. We implement the com-
puting units for DiracDeltaNet on an Ultra96 SoC system through
high-level synthesis. Our accelerator’s final top-5 accuracy of 88.1%
on ImageNet, is higher than all the previously reported embedded
FPGA accelerators. In addition, the accelerator reaches an inference
speed of 96.5 FPS on the ImageNet classification task, surpassing
prior works with similar accuracy by at least 16.9×.
1 INTRODUCTION
ConvNets power state-of-the-art solutions on a wide range of com-
puter vision tasks. However, the high computational complexity
of ConvNets hinders their deployment on embedded and mobile
devices, where computational resources are limited. Using FPGAs
to accelerate ConvNets has attracted significant research attention
in recent years. FPGAs excel at low-precision computation, and
their adaptability to new algorithms lends themselves to supporting
rapidly changing ConvNet models.
Despite recent efforts to use FPGAs to accelerate ConvNets, as
[1] points out, there still exists a wide gap between accelerator
architecture design and ConvNet model design. The computer vi-
sion community has been primarily focusing on improving the
accuracy of ConvNets on target benchmarks with only secondary
attention to the computational cost of ConvNets. As a consequence,
*Work done while interning at UC Berkeley.
† Source code and pre-trained model are available at https://github.com/Yang-YiFan/
DiracDeltaNet.
recent ConvNets have been trending toward more layers [2], more
complex structures [3, 4], and more complicated operations [5].
On the other hand, FPGA accelerator design has not leveraged
the latest progress of ConvNets. Many FPGA designs still focus on
networks trained onCIFAR10 [6], a small dataset consisting of 32x32
thumbnail images. Such dataset is usually used for experimental
purposes and is too small to have practical value. More recent
designs aim to accelerate inefficient ConvNets such as AlexNet
[7] or VGG16 [8], both of which have fallen out of use in state-of-
the-art computer vision applications. In addition, we observe that
in many previous designs, key application characteristics such as
frames-per-second (FPS) are ignored in favor of simply counting
GOPs, and accuracy, which is critical to applications, is often not
even reported.
Specifically, we see a gap between ConvNet architectures and
accelerator design in the following areas:
Inefficient ConvNetmodels: Many FPGA accelerators still tar-
get older, inefficient models such as AlexNet and VGG16, which
require orders-of-magnitude greater storage and computational re-
sources than newer, efficient models that achieve the same accuracy.
With an inefficient model, an accelerator with high throughput in
terms of GOPs can actually have low inference speed in terms of
FPS, where FPS is the more essential metric of efficiency. To achieve
AlexNet-level accuracy, SqueezeNet [9] is 50x smaller than AlexNet;
SqueezeNext [10] is 112x smaller; ShiftNet-C [11], with 1.6% higher
accuracy, is 77x smaller. However, not many designs target those
efficient models. Additionally, techniques for accelerating older
models may not generalize to newer ConvNets.
ConvNet structures: Most ConvNets are structured solely for
better accuracy. Some ConvNets are structured for optimal GPU
efficiency, but few, if any, are designed for optimal FPGA efficiency.
For example, the commonly used additive skip connection [12]
alleviates the difficulty of training deep ConvNets and significantly
boosts accuracy. Despite its mathematical simplicity, the additive
skip connection is difficult to efficiently implement on FPGAs. Ad-
ditive skip connections involve adding the output data from a previ-
ous layer to the current layer, which requires either using on-chip
memory to buffer the previous layer’s output or fetching the output
from off-chip memory. Both options are inefficient on FPGAs.
ConvNet operators: ConvNet models contain many different
types of operators. Commonly used operators include 1×1, 3×3,
5×5 convolutions, 3×3 max-pooling, etc. More recent models also
contain depth-wise, group, dilated, and factorized convolutions.
Not all of these operators can be efficiently implemented on FP-
GAs. If a ConvNet contains many different types of operators, one
must either allocate more dedicated compute units or make the
ar
X
iv
:1
81
1.
08
63
4v
3 
 [c
s.C
V]
  2
1 M
ay
 20
19
compute unit more general. Either solution can potentially lead to
high resource requirement, limited parallelism, and more compli-
cated control flow. Also, hardware development will require more
engineering effort.
Quantization: ConvNet quantization has been widely used to
convert weights and activations from floating point to low-precision
numbers to reduce the computational cost. However, many of the
previous methods are not practically useful for FPGAs due to the
following problems: 1) Quantization can lead to serious accuracy
loss, especially if the network is quantized to low precision numbers
(less than 4 bits). Accuracy is vital for many computer vision applica-
tions. Unfortunately, carefully reporting accuracy has not been the
norm in the FPGA community. 2) Many of the previously presented
quantization methods are only effective on large ConvNet models
such as VGG16, AlexNet, ResNet, etc. Since those models are known
to be redundant, quantizing those to low-precision is much easier.
We are not aware of any previous work tested on efficient models
such as MobileNet or ShuffleNet. 3) Many methods do not quantize
weights and activations directly to fixed point numbers. Usually,
quantized weights and activations are represented by fixed-point
numbers multiplied by some shared floating point coefficients. Such
representation requires more complicated computation than purely
fixed-point operations, and are therefore more expensive.
In this work, we adopt an algorithm-hardware co-design ap-
proach to develop a ConvNet accelerator called Synetgy and a
novel ConvNet model called DiracDeltaNet. Both the accelerator
and the ConvNet are tailored to FPGAs and are optimized for Ima-
geNet classification accuracy and inference speed (in terms of FPS).
Our co-design approach produces a novel ConvNet architecture
DiracDeltaNet that is based on ShuffleNetV2 [13], one of the state-
of-the-art efficient models with small model size, low FLOP counts,
hardware friendly skip connections, and competitive accuracy. We
optimize the network by replacing all 3×3 convolutions with shift
operations [11] and 1×1 convolution, enabling us to implement a
compute unit customized for 1×1 convolutions for better efficiency.
The name “DiracDeltaNet” comes from the fact that the network
only convolves input feature maps with 1×1 kernels. Such kernel
functions can be seen as discrete 2D Dirac Delta functions. We
further quantize the network to 4-bit weights and 4-bit activations,
exploiting the strengths of FPGAs, with only a less than 1% accu-
racy drop. In short, DiracDeltaNet’s small model size, low operation
count, low precision and simplified operators allow us to co-design
a highly customized and efficient FPGA accelerator. Furthermore,
the implementation only took two people working for one month
using High-Level Synthesis (HLS).
We trained DiracDeltaNet on ImageNet, implemented it on our
accelerator architecture, Synetgy, and deployed on a low-cost FPGA
board (Ultra96). Our inference speed reaches 96.5 FPS, surpass-
ing previous works with similar accuracy by at least 16.9x. The
DiracDeltaNet on our accelerator architecture also achieves 88.1%
top-5 classification accuracy – the highest among all the previously
reported embedded FPGA accelerators.
2 BACKGROUND
2.1 Efficient ConvNet Models
For the task of image classification, improving accuracy on the
ImageNet [14] dataset has been the primary focus of the computer
vision community. For applications that are sensitive to accuracy,
even a 1% improvement in accuracy on ImageNet is worth doubling
or tripling model complexity. As a concrete example, ResNet152
[12] achieves 1.36% higher ImageNet accuracy than ResNet50 at
the cost of 3x more layers. In recent years, efficient ConvNet mod-
els have begun to receive more research attention. SqueezeNet [9]
is one of the early models focusing on reducing the parameter
size. While SqueezeNet is designed for image classification, later
models, including SqueezeDet [15] and SqueezeSeg [16, 17], extend
the scope to object detection and point-cloud segmentation. More
recent models such as MobileNet [18, 19] and ShuffleNet [13, 20]
further reduce model complexity. However, without a target com-
puting platform in mind, most models designed for “efficiency” can
only target intermediate proxies to efficiency, such as parameter
size or FLOP count, instead of focusing on more salient efficiency
metrics, such as speed and energy. Recent works also try to bring in
hardware insight to improve the actual efficiency. SqueezeNext[10]
uses a hardware simulator to adjust the macro-architecture of the
network for better efficiency. ShiftNet[11] proposes a hardware-
friendly shift operator to replace expensive spatial convolutions.
AddressNet[21] designed three shift-based primitives to accelerate
GPU inference.
2.2 ConvNet Quantization
ConvNet quantization aims to convert full-precision weights and
activations of a network to low-precision representations to reduce
the computation and storage cost. Early works [22, 23] mainly focus
on quantizing weights while still using full-precision activations.
Later works [24–27] quantize both weights and activations. Many
previous works [23–25] see serious accuracy loss if the network is
quantized to low precisions. Normally, an accuracy loss of more
than 1% is already considered significant. Also, in many works
[23, 26], quantized weights or activations are represented by low-
precision numbers multiplied with some floating point coefficients.
This can bring several challenges to hardware implementation. Last,
but not least, most of the previous works report quantization results
on inefficient models such as VGG, AlexNet, and ResNet. Given that
those models are redundant, quantizing them to lower precisions
is much easier. We have not yet seen any work which successfully
applies quantization to efficient models.
2.3 Hardware Designs
Most existing ConvNet hardware research has focused on improv-
ing the performance of either standalone 3×3 convolution layers or
a full-fledged, large ConvNet on large FPGA devices. [28] quantita-
tively studies the computation throughput and memory bandwidth
requirement for ConvNets. [29, 30] present their own optimiza-
tion for ConvNets based on analytical performance models. They
achieve high throughput on VGG16 using their proposed design
methodology with OpenCL. [31] designs convolution in frequency
domain to reduce the compute intensity of the ConvNet. They
2
demonstrate good power performance results on VGG16, AlexNet,
and GoogLeNet. [32] implements a ternary neural network on high-
end Intel FPGAs and achieves higher performance/Watt than Titan
X GPU. Most of the works mentioned above and others [33–35],
target inefficient ConvNets on middle to high-end FPGA devices.
For compact ConvNets, [36] demonstrates a binary neural net-
work(BNN) FPGA design that performs CIFAR10 classification at
21906 frames per second(FPS) with 283 µs latency on Xilinx ZC706
device. The BNN reports an accuracy of 80.1%. [37, 38] run the
BNN on a smaller device ZC7020. Although all three works achieve
promising frame rates, they have not implemented larger neural
networks for the ImageNet classification. It should be noted that
classification on CIFAR10 dataset is orders of magnitude simpler
than ImageNet, since CIFAR10 contains 100x fewer classes, 26x
fewer images, and 49x fewer pixels in each image. Networks trained
on CIFAR10 dataset also have way smaller complexity compared
to those trained on ImageNet. In comparison, networks for Ima-
geNet classification are closer to real-world applicability. [39] first
attempted to deploy VGG16 for ImageNet classification on embed-
ded device zc7020 and achieved a frame rate of 4.45 fps. Later [40]
improved the frame rate to 5.7 fps. However, their frame rate was
relatively low for real-time image classification tasks. [39, 41, 42]
have achieved high frame rate on smaller devices, however, the
accuracy of their network is not on par with [40] for ImageNet
classification.
3 CONVNET DESIGN
We discuss the ConvNet design in this section. The design of our
ConvNet incorporates the feedback from both the computer vision
applications and hardware accelerator design. Specifically, an ideal
ConvNet model for embedded FPGA acceleration should satisfy
the following aspects: 1) The network should not contain too many
parameters or FLOPs but should still maintain a competitive accu-
racy. 2) The network structure should be hardware friendly to allow
efficient scheduling. 3) The network’s operation set should be sim-
plified for efficient FPGA implementation. 4) The network’s weights
and activations should be quantized to low-precision fixed-point
numbers without much accuracy loss.
3.1 ShuffleNetV2
We select ShuffleNetV2-1.0x [13] as our starting point. ShuffleNetV2
is one of the state-of-the-art efficient models. It has a top-1 accuracy
of 69.4% on ImageNet (2% lower than VGG16), but contains only
2.3M parameters (60x smaller than VGG16) and 146M FLOPs (109x
smaller than VGG16).
The block-level structure of ShuffleNetV2 is illustrated in Fig.
1(a). The input feature map of the block is first split into two parts
along the channel dimension. The first branch of the network does
nothing to the input data and directly feeds the input to the out-
put. The second branch performs a series of 1×1 convolutions, 3×3
depth-wise convolutions and another 1×1 convolution operations
on the input. Outputs of two branches are then concatenated along
the channel dimension. Channel shuffle [20] is then applied to ex-
change information between branches. In down-sampling blocks,
depth-wise 3×3 convolutions with a stride of 2 are applied to both
1x1	Conv
3x3	DWConv
1x1	Conv
Channel	Split
Concat &	Shuffle
BN	ReLU
BN
BN	ReLU
1x1	Conv
3x3	DWConv
Stride=2
1x1	Conv
Concat &	Shuffle
BN	ReLU
BN
BN	ReLU
1x1	Conv
BN	ReLU
3x3	DWConv
Stride=2
(a) ShuffleNetV2 blocks [13].
1x1	Conv
Shift
1x1	Conv
Channel	Split
ActQuant
ActQuant
Concat &	Shuffle
1x1	Conv
Shift
1x1	Conv
Concat &	Shuffle
ActQuant
ActQuant
1x1	Conv
ActQuant
2x2	Maxpool
Stride=2 2x2	Maxpool
Stride=2
(b) Our modified DiracDeltaNet blocks. We replace depth-wise con-
volutions with shift operations. In the downsampling blocks, we use
stride-2 max-pooling and shift operations to replace stride-2 depth-
wise convolutions. We also double the filter number of the 1st 1×1
convolution on the non-skip branch in each module.
Figure 1: ShuffleNetV2 blocks vs. DiracDeltaNet blocks
branches of the block to reduce the spatial resolution. 1×1 convo-
lutions are used to double the channel size of input feature maps.
These blocks are cascaded to build a deep ConvNet.We refer readers
to [13] for the macro-structure description of the ShuffleNetV2.
We select ShuffleNetV2-1.0x not only because of its small model
size and low FLOP count but also because it uses concatenative
skip connections instead of additive skip connections. Additive
skip connections, as illustrated in Fig. 2(a), were first proposed in
[12]. It effectively alleviates the difficulty of training deep neural
networks and therefore improves accuracy. It is widely used in
many ConvNet designs. However, additive skip connections are
not efficient on FPGAs. As illustrated in Fig. 2(a), both the skip and
the residual branches’ data need to be fetched on-chip to conduct
the addition. Though addition does not cost too much computation,
the data movement is expensive. Concatenative skip connections,
as illustrated in Fig. 2(b), were first proposed in [3]. It has a similar
positive impact to the network training and accuracy. With con-
catenative skip connections, data from skip branch is already in
off-chip DRAMs. So we can concatenate the two branches simply
by writing the residual branch data next to the skip branch data.
This avoids the extra memory access in additive skip connections
and alleviates the memory bandwidth pressure.
3
+(a)	Additive	skip	connection (b)	Concatenative	skip	connection
Skip	branch Residual	branch
Skip	branch Residual	branch
Fetch Fetch
Write Write
Figure 2: Additive Skip Connections vs. Concatenative Skip
Connections. Rectangles represent data tensors.
3.2 DiracDeltaNet
Based on ShuffleNetV2, we build DiracDeltaNet through the follow-
ing modifications: 1) we replace all the 3×3 convolutions with shift
and 1×1 convolutions; 2) we reduce the kernel size of max-pooling
from 3×3 to 2×2; 3) we modify the order of channel shuffle.
We replace all of the 3×3 convolutions and 3×3 depth-wise convo-
lutions with shift operations and 1×1 convolutions. The motivation
is that smaller convolution kernel sizes require less reuse of the
feature map, resulting in simpler data movement schedule, con-
trol flow, and timing constraint. As pointed out by [11], ConvNets
rely on spatial convolutions (3×3 convolutions and 3×3 depth-wise
convolutions) to aggregate spatial information from neighboring
pixels to the center position. However, spatial convolutions can be
replaced by a more efficient operator called shift. The shift operator
aggregates spatial information by copying nearby pixels directly to
the center position. This is equivalent to shifting one channel of
feature map towards a certain direction. When we shift different
channels in different directions, the output feature map’s channel
will encode all the spatial information. A comparison between 3×3
convolution and shift is illustrated in Fig. 4. A module containing a
shift and 1×1 convolution is illustrated in Fig. 5.
For 3×3 depth-wise convolutions, we directly replace them with
shift operations, as shown in Fig. 1(b). This direct replacement
can lead to some accuracy loss. To mitigate this, we double the
output filter number of the first 1×1 convolution on the non-skip
branch from Fig. 1(b). Nominally, doubling the output channel size
increases both FLOP count and parameter size by a factor of 2.
However, getting rid of 3×3 convolutions allows us to design a com-
puting unit customized for 1×1 convolutions with higher execution
efficiency than a comparable unit for 3×3 depth-wise convolutions.
In the downsample block, we directly replace the strided 3×3 depth-
wise convolutions with a stride-2 2×2 max-pooling. Unlike [11], our
shift operation only uses 4 cardinal directions (up, down, left, right)
in addition to the identity mapping (no-shift). This simplifies our
hardware implementation of the shift operation without hurting
accuracy.
The first stage of ShuffleNetV2 consists of a 3×3 convolution
with a stride of 2 and filter number of 24. It is then followed by a
3×3 max-pooling with a stride of 2. We replace these two layers to a
module consisting of a series of 1×1 convolution, 2×2 max-pooling,
and shift operations, as shown in Table 1. Compared with the origi-
nal 3×3 convolutions, our proposed module has more parameters
(2144 vs 648) and FLOPs (30.5M vs 8.1M). But the implementation
and execution cost of the proposed first stage is negligible com-
pared to a 3×3 convolution layer. After training the network, we
find that this module gives near equal accuracy than the original
(a) Transpose based channel shuffle
(b) Our channel shuffle
Figure 3: Transpose Based Shuffle (ShuffleNetV2) vs. Our
HW Efficient Shuffle (DiracDeltaNet)
3×3 convolution module. With our new module, we can eliminate
the remaining 3×3 convolutions from our network, enabling us to
allocate more computational resources to 1×1 convolutions, and
thereby increasing parallelism and throughput.
In addition to replacing all 3×3 convolutions, we also reduce
the max-pooling kernel size from 3×3 to 2×2. By using the same
pooling kernel size as the stride, we eliminate the need to buffer
extra data on the pooling kernel boundaries, thereby achieving
better efficiency. Our experiments also show that reducing the
max-pooling kernel size does not impact accuracy.
We also modify the channel shuffle’s order to make it more
hardware efficient. ShuffleNetV2 uses transpose operation to mix
channels from two branches. This is illustrated in Fig. 3(a), where
blue and red rectangles represent channels from different branches.
The transpose based shuffling is not hardware friendly since it
breaks the contiguous data layout. Performing channel shuffle in
this manner will require multiple passes of memory read and write.
We propose a more efficient channel shuffle showed in Fig. 3(b).
We perform a circular shift to the feature map along the channel
dimension. We can have the same number of channels exchanged
between two brancheswhile preserving the contiguity of the feature
map and minimizing the memory accesses.
We name themodified ShuffleNetV2-1.0xmodel as DiracDeltaNet.
The name comes from the fact that our network only contains 1×1
convolutions. With a kernel size of 1, the kernel functions can be
seen as discrete 2D Dirac Delta functions. DiracDeltaNet’s macro-
structure is summarized in Table 1. Stage 2,3,4 consist of chained
DiracDeltaNet blocks depicted in Fig. 1 with different feature map
size, channel size and stride. We adopt the training recipe and hy-
perparameters described in [13]. We train DiracDeltaNet for 90
epoch with linear learning rate decay, the initial learning rate of
0.5, 1024 batch size and 4e-5 weight decay. A comparison between
ShuffleNetV2-1.0x and our DiracDeltaNet is summarized in Table 2.
3.3 ConvNet Quantization
To further reduce the cost of DiracDeltaNet, we apply quantization
to convert floating point weights and activations to low-precision
integer values. For network weights, we follow DoReFa-Net [25] to
quantize full-precision weights as
wk = 2Qk (
tanh(w)
2max(| tanh(w)|) + 0.5) − 1. (1)
4
Table 1: Macro-structure of DiracDeltaNet
Layer Outputsize
Kernel
size Stride #Repeat
Output
channel
Image 224 3
Conv1
Maxpool
shift
Conv2
Maxpool
shift
224
112
112
112
56
56
1
2
3
1
2
3
1
2
1
1
2
1
1
1
1
1
1
1
32
64
Stage 2 2828
2
1
1
3 128
Stage 3 1414
2
1
1
7 256
Stage 4 77
2
1
1
3 512
Conv5 7 1 1 1 1024
GlobalPool 1 7 1 1024
FC 1 1000
Table 2: ShuffleNetV2-1.0x vs. DiracDeltaNet
MACs #Params Top-1 acc Top-5 acc
ShuffleNetV2-1.0x 146M 2.3M 69.4% -
DiracDeltaNet 330M 3.3M 68.9% 88.7%
(a)	3x3	convolution (b)	Shift
1
Figure 4: 3×3 Convolution vs. Shift. In 3×3 convolutions, pix-
els in a 3×3 region are aggregated to compute one output
pixel at the center position. In the shift operation, a neigh-
boring pixel is directly copied to the center position.
Shift
…
N
N
DF
DF
…
1x1	conv
M
DF
DF
M
DF
DF
⨷
………
M
Figure 5: Using shift and 1×1 convolutions to replace 3×3
convolutions. This figure is from [11].
Here,w denotes the latent full-precision weight of the convolution
kernel. Qk (·) is a function that quantizes its input in the range of
[0, 1] to its nearest neighbor in { i2k−1 |i = 0, · · · 2k−1}.
We follow PACT [26] to quantize each layer’s activation as
yl = PACT
(
x l
)
=
x l  − x l − α l  + α l 
2 ,
yl = Qk
(
yl /
α l ) · α l  . (2)
x l is the activation of layer-l . PACT (·) is a function that clips the
activation x l to the range between [0,
α l ]. α l is a layer-wise train-
able upper bound, determined by the training of the network. It is
observed that during training α l can sometimes become a negative
value, which affects the correctness of the PACT [26] function. To
ensure α l is always positive and to increase training stability, we
use the absolute value of the trainable parameter α l rather than its
original value. yl is the clipped activation from layer-l and it is fur-
ther quantized to ylk , a k-bit activation tensor. Note that activations
from the same layer share the same floating point coefficient α l ,
but activations from different layers can have different coefficients.
This is problematic for the concatenative skip connection, since if
the coefficients α l and α l−1 are different, we need to first cast yl−1k
and ylk from fixed-point to floating point, re-calculate a coefficient
for the merged activation, and quantize it again to new fixed-point
numbers. This process is very inefficient.
In our experiment, we notice that most of the layers in the
DiracDeltaNet have similar coefficients with values. Therefore, we
rewrite equation (2) as
yl = Qk
(
yl /
α l ) · |s |. (3)
where s is a coefficient shared by the entire network. This step
ensures that activations from different layers of the network are
quantized and normalized to the same scale of [0, |s |]. As a result, we
can concatenate activations from different layers directly without
extra computation. Moreover, by using the same coefficient s across
the entire network, the convolution can be computed completely
via fixed-point operations. The coefficient s can be fixed before or
leave it as trainable. A general rule is that we should let s have
similar values of α l from different layers. Otherwise, if s/α l is
either too small or too large, it can cause gradient vanishing or
exploding problems in training, which leads to a worse accuracy of
the network.
In our network, we merge the PACT function and activation
quantization into one module and name it ActQuant. The input
to ActQuant is the output of 1×1 convolutions. Since the input
and weight of the convolution are both quantized into fixed-point
integers, the output is also integers. Then, ActQuant is implemented
as a look-up-table whose parameters are determined during training
and fixed during inference.
We follow [27] to quantize the network progressively from full-
precision to the desired low-precision numbers. The process is
illustrated in Fig. 6, where x-axis denotes bit-width of weights and
y-axis denotes the bit-width of activations. We start from the full-
precision network, train the network to convergence, and follow a
path to progressively reduce the precision for weights or activations.
At each point, we fine-tune the network for 50 epochs with step
learning rate decay. Formally, we denote each point in the grid
as a quantization configuration Cw,a (Nw ). Herew represents the
5
Weight 32    16     8  4    3  2    1
activation
16
8
4
3
2
1
Ours
2-Stage [26]
Progressive [26]
Figure 6: Quantization Grid
Table 3: Quantization Result on DiracDeltaNet
full w4a4
Top-1 Acc 68.9% 68.3%
Top-5 Acc 88.7% 88.1%
bitwidth of weight. a is the bitwidth of activation.Nw is the network
containing the quantized parameters. The starting configuration
would be the full precision network C32,32 (N32). Starting from this
configuration, one can either go down to quantize the activation or
go right to reduce the bitwidth of weight. More aggressive steps
can be taken diagonally or even across several grids. The two-stage
and progressive optimization methods proposed in [27] can be
represented as two paths in Fig. 6.
In our work, we start from C32,32 (N32). Then we use N32 to
initialize N16 and obtain C16,16 (N16). And we apply step lr decay
fine-tuning onto N16 to recover the accuracy loss due to the quan-
tization. After several epochs of fine-tuning, we get the desired
low-precision configuration C16,16
(
N ′16
)
with no accuracy loss.
Following the same procedures, we are able to first go diagonally in
the quantization grid to C4,4 (N4) with less than 1% top-5 accuracy
loss compared to its full precision counterpart.
We use a pre-trained ResNet50 label-refinery [43] to boost the
accuracy of the quantized model. Even with such low-precision
quantization, our quantized model still preserves a very competitive
top-5 accuracy of 88.1%. Most of the previous quantization works
[25–27] are only effective on large models such as VGG16, AlexNet
or ResNet50. Our quantization result is summarized in Table 3.
4 HARDWARE DESIGN
Asmentioned in section 3.2, we aggressively simplified ShuffleNetV2’s
operator set. Our modified network is mainly composed of the fol-
lowing operators:
• 1 × 1 convolution
• 2 × 2 max-pooling
• shift
• shuffle and concatenation
Our accelerator, Synetgy, is tailored to only support the operators
above. This allows us to design more specialized compute units with
simpler control, which enables us to further improve the hardware
efficiency. The compute of the fully-connected layer can be mapped
onto our convolution unit. Shuffle operation is not fully supported
on FPGA. CPU-based memory copy is needed to maintain the mem-
ory layout. And the remaining average-pooling layer which is not
Table 4: Notations
Notation Type Description
WIDTH variable width of feature map
HEIGHT variable height of feature map
IC_TOTAL variable total input channel size
OC_TOTAL variable total output channel size
IC constant: 32 parallelism on input channel dimension
OC constant: 32 parallelism on output channel dimension
supported on the FPGA is offloaded to the ARM processor on the
SoC platform.
The benefits of simplified operator come from the algorithm-
hardware co-design, which also increase the productivity of hard-
ware implementation. The accelerator implementation only took
two people working for one month using HLS.
4.1 The accelerator architecture
Fig. 7 shows the overall accelerator architecture design. Our accel-
erator, highlighted in light yellow, can be invoked by the CPU for
computing one 1× 1 Conv-Pooling-Shift-Shuffle subgraph at a time.
The CPU provides supplementary support to the accelerator. Both
the FPGA and the CPU are used to run the network.
Conversion
Pooling
in_fmap_stream
CPU 
D
D
R
 D
R
AM
Shift
Shuffle
Controller
Weights 
On-Chip Buffer 
...
...
...... ...
1 x 1 Conv Units 
32
32
out_fmap_stream 13-bit
4-bit
4-bit
1-bit
Figure 7: Accelerator Architecture
In quantized DiracDeltaNet, weights are 4-bit, input and output
activations are 4-bit, and the largest partial sum is 17-bit. The width
of partial sum is determined by the input feature bit width and the
largest channel size. Given that the largest channel size is 512, there
are 24 × 24 × 512 possible outcomes from the convolution, which
requires 17 bits to represent.
4.1.1 Dataflow Architecture. Our hardware design is based on the
dataflow architecture template [44, 45]. As illustrated in Fig. 7, we
first extract a few process functions from the major operations
including 1×1 convolution, 2×2max-pooling, shift, shuffle and the
memory load and store. We then chain them together using FIFOs
with blocking read and non-blocking write. Note that the write is
blocking once the FIFO is full. All the process functions are running
concurrently and the execution of each function is triggered by the
arrival of the data. Therefore, more task-level parallelism can be
6
explicitly exposed to the HLS tool in addition to the instruction-
level parallelism.
4.1.2 Convolution Unit. The notations used in this section are
listed in Table 4. As shown in Fig. 8, given an input feature map
of sizeWIDTH × HEIGHT × IC_TOTAL and a weight kernel of
size IC_TOTAL×OC_TOTAL, the generated output feature map is
of sizeWIDTH ×HEIGHT ×OC_TOTAL in 1×1 convolution. The
1×1 convolution is essentially a matrix-matrix multiplication.
O
C
_T
O
TA
L
IC_TOTAL IC_TOTAL
...
OC_TOTAL
1x1
H
EI
G
H
T
WIDTH WIDTH
H
EI
G
H
T
Weights Input Feature
Maps 
Output Feature
Maps 
Figure 8: 1×1 Convolution
Although [1] suggests a weight stationary dataflow for 1 × 1
convolution dominant ConvNets, we find it not applicable to our
design as the bit width of weights is much smaller than the partial
sums (4 bit vs 17 bits). Transferring the partial sums on and off-chip
will incur more traffic on the memory bus. Therefore, we adopt
the output stationary dataflow by retaining the partial sums in the
local register file until an output feature is produced.
Figure 9: Pseudo Code for Kernel Compute Scheduling
Fig. 9 shows how we schedule the workload onto the accelerator.
Note that the nested loops starting at line 17, 19 are automatically
unrolled. Weights are prefetched onto on-chip BRAMweiдht_bu f .
We first block our inputs so IC ×OC multiplications can be mapped
onto the compute units at each iteration (Line 13∼21). In every
iteration, IC input features are fetched from the DRAM. They are
convolved with OC number of weights of size IC and produce
OC partial sums. Each iteration of the loop nest along the input
channel dimension at line 12 takes 7 ∼ 38 cycles to finish based on
the Vivado HLS report. Equivalently, it takes 7 ∼ 38 cycles to finish
IC ×OC 4/4 bit multiplication. The partial sums are stored in the
registers, which can be simultaneously accessed in every cycle. The
parameter IC andOC were tuned for the area performance tradeoff.
Increasing them increases overall resource utilization but helps to
reduce the total number of execution cycles.
Based on the roofline model [46], the attainable throughput is
the compute-to-communication (CTC) ratio multiplied by the band-
width when it is bandwidth bound. The CTC ratio of our compute
unit for the input feature is OC_TOTAL (maximum number is 512
in DiracDeltaNet), which is a variable. Larger output channel size
indicates higher CTC ratio. According to our measurement, the
maximum bandwidth of the DDR channel is 6GB/s, which means
6 × 2 Giga input features (1 Byte contains two 4-bit features) can
be loaded. The theoretical memory bound throughput should be
512 × 6 × 2 = 6144GMACs = 12288GOPs. For compute bound
problems, the attainable throughput is dependent on the compute
capability. In our case, it is IC ×OC × f req = 32 × 32 × 250MHz =
256GMACs=512GOPs. Based on the analysis, the convolution unit
will reach the bandwidth bound before it hits the computation
roofline.
O
C
IC_TOTAL
IC_TOTAL
...
H
EI
G
H
T
WIDTH
Weights 
Input Feature
Maps 
IC  
1
Memory Layout 
0Addr
...
...
IC IC_TOTAL - 1
Addr
...
IC_TOTAL + ICIC_TOTAL   2 x IC_TOTAL - 1
1
W
ID
TH
H
EI
G
H
T
...
...
IC
1 x 1
1 x 1
Memory Layout 
0Addr
OC 
...
...
IC IC x (OC - 1)
Addr
...
IC x (OC + 1) IC x (2 x OC - 1)
IC
_T
O
TA
L 
/ I
C
O
C
_T
O
TA
L 
/ O
C
...
O
C
_T
O
TA
L
...
IC x OC
IC_TOTAL / IC
Text
Figure 10: Input Layout in DRAM
4.1.3 Conversion Unit. The high bitwidth to low bitwidth conver-
sion is performed immediately after the kernel computation. It is a
step function with 16 intervals that converts 17-bit partial sum to
4-bit activation. The threshold values are different for each layer.
All of the read-only threshold values are stored in on-chip BRAMs.
An index number should be specified by the user function to select
which set of threshold values to use for the compute of the current
layer. In hardware, this unit is implemented by using 16 compara-
tors. They are mapped onto a binary tree structure to reduce the
circuit latency.
4.1.4 Pooling Unit. We adopt the line buffer design described in
[37] to implement the 2 × 2 max-pooling layer. For every iteration,
(WIDTH +1) of IC deep pixels are first fetched into the line buffers.
7
Once the next pixel value is fetched, a 2 × 2 large sliding window
is formed. For every 2 cycles, we compare the values in the 2 × 2
sliding window, output the largest one, and fetch the next 2 values.
It takes IC_TOTAL/IC iterations to finish the compute.
4.1.5 Shift Unit. The line buffer design is also used for the shift
operation. In the shift unit, the input images are first padded with 1
zero-value pixel at the width and height dimension. (2×(WIDTH +
2) + 2) of pixels are then buffered and a 3 × 3 sliding window is
formed. The shift direction is different for different input channels.
It is calculated based on the input channel index. After initialization,
the unit is able to produce 1 output pixel per cycle.
4.1.6 Shuffle Unit. Shuffle is implemented by changing the ad-
dress offset of output features during the writeback phase. Since
the shuffle operation still requires us to concatenate the outputs
from the previous DiracDeltaNet block to the current DiracDeltaNet
block outputs, the CPU is used to copy the output from previous
DiracDeltaNet unit to the shuffled address. The memory copy oper-
ation should be done concurrently with the computation of current
DiracDeltaNet unit.
4.1.7 Fully Connected Unit. We don’t explicitly design a dedicated
unit to compute FC layer. Instead, we map the compute of FC layer
onto our existing hardware convolution unit. The feature map size
is 1 for the FC layer. While the convolution unit only supports 4-bit
weight, the FC layer’s computation is mapped in a bit serial like
manner. The convolution unit processes each bit of the FC weight
iteratively and bit shift is done by configuring the step function in
the conversion unit.
4.2 Software
We use the ARM processor to control the layer-based accelerator
and to compute the last 7 × 7 average-pooling layer that is not sup-
ported by the accelerator. The host application runs on a full Linux
system on the ARM CPU, which controls the memory-mapped ac-
celerator through the UIO driver interface. The Xilinx python-based
PYNQ APIs [47] are used for fast deployment of the host software
code on the Ultra 96 board.
4.3 Experimental Results
We implement our accelerator, Synetgy, on the Ultra96 development
board with Xilinx Zynq UltraScale+ MPSoC targeted at embedded
applications. Table 5 shows the overall resource utilization of our
implementation. We are able to utilize 34% of the total LUTs on
the FPGA, as the bit-level 4/4bit multiplications are mapped onto
LUTs. BRAMs are mainly used for implementing the FIFO channels.
DSPs are used for the address calculation for the AXI protocol.
Our implementation runs at 250 MHz. Power measurements are
obtained via a power monitor. Wemeasured 5.3Wwith no workload
running on the programming logic side and 5.5W max power on
the Ultra96 power supply line when running our network.
Table 5: Resource Usage
LUT FF BRAM DSP
24130 (34.2%) 29867 (21.2%) 170 (78.7%) 37 (10.3%)
We compare our accelerator against previous work in Table 6. As
explained before, ConvNets for ImageNet classification are usually
orders of magnitude more complex than CIFAR10 classification.
Therefore, we only compare accelerators targeting ConvNets for
ImageNet classification with reasonable accuracy. Our work focuses
on achieving competitive accuracy while improving the actual in-
ference speed in terms of frames per second. Our experiments
show that we successfully achieve those two goals. From the table,
we can make the following observations: 1) Synetgy achieves the
highest top-1 and top-5 accuracy on ImageNet. The only previous
work that comes close to our accuracy is [40], but its frame rate
is 16.9× slower than ours. 2) Among the embedded accelerators
whose top-1 accuracy is higher than 60%, which is a loose con-
straint, our model achieves the fastest inference speed. 3) Without
the accuracy constraint, the speed of [41, 42, 48] can go as fast as
864.7 frames per second. But their accuracy is rather low. 4) The
peak attainable throughput of our accelerator is 418 GOPs, which is
close to the theoretical compute roofline. Our average throughput
(47.09 GOPs) is currently limited by the low hardware utilization.
The inefficiency is mainly from the software shuffle operations and
the first convolution layer whose input dimension is 3 which is
much less than the hardware tiling factor IC . However, Synetgy
still achieves competitive frame rate, demonstrating the efficacy
of our co-design methodology. We see the opportunity of signifi-
cant frame rate improvement through further algorithm-hardware
co-design.
The reported frame rate is achieved with batch size set to 16.
There is a fixed software overhead for invoking the poll-based hard-
ware accelerator. The computation latency of the DiracDelta Block1
in Table 9 is 0.15ms when the batch size is equal to 1. The latency
for a single read on the accelerator control register is 0.40ms, which
is greater than the actual compute time. In order to minimize this
software overhead, we increase the batch size to schedule more
computation running on the accelerator per invocation. Further-
more, the weights stored in on-chip BRAM get reused more when
batch size is increased. The frame rates of implementations with
different batch sizes are summarized in Table 7.
We break down the runtime of the whole heterogeneous system
by bypassing one component of the system and measure the run-
time. The result is shown in Table 8. The whole system runs at 95.9
FPS on ImageNet classification at a batch size of 10, including both
hardware PE execution and software execution of average pooling,
and shuffle. We see from the table that the CPU-based memory copy
for the shuffle operation significantly degrades the performance.
All other non-conv components impact the overall performance
slightly.
To further understand the efficiency of various operators (1×1
conv, 2×2 max-pooling, shift, and shuffle) implemented on FPGA
and CPU, we measure the runtime of the DiracDeltaNet blocks
with different configurations on Synetgy. The result is summarized
in Table 9. We test 2 blocks with different input feature map and
channel sizes. Note that the theoretical OPs of Block1 and Block2
is the same. As shown in the table, pooling and shift incur almost
no performance drop. This is because the process functions for
performing these operations do not impose new bottlenecks on the
dataflow pipeline. Software memory copy latency of shuffle is more
significant on Block1 than Block2. This is because memory copy
8
Table 6: Performance Comparison of Synetgy and Previous Works
VGG-SVD[39] AlexNet[48] VGG16[49] VGG16 [40] DoReFa[42] FINN-R [41] Ours
Platform Zynq XC7Z045 Stratix-V Stratix-V Zynq 7Z020 Zynq 7Z020 Zynq ZU3EG Zynq ZU3EG
Frame Rate (fps) 4.5 864.7 3.8 5.7 106.0 200.0 96.5
Top-1 Acc 64.64% 42.90% 66.58% 67.72% 46.10% 50.3% 68.30%
Top-5 Acc 86.66% 66.80% 87.48% 88.06% 73.10% N/A 88.12%
Precision 16b 16b 8-16b 8b 2b 1-2b 4-4b
Throughput (GOPs) 136.97 1963.96 117.80 123 410.22 400 47.09 (Overall)418 (Peak)
Frequency(MHz) 150 150 120 214 200 220 250
Power(W) 3.0 26.2 19.1 3.0 2.3 10.2 5.5
Table 7: Frame Rate on Different Batch Size
Batch Size 1 2 4 8 10 16
Frame Rate (fps) 58.7 72.9 84.1 94.4 95.9 96.5
Table 8: Runtime Latency for Different Functional Parts of
the Whole System (Batch=10)
Runtime (ms) Frame Rate (fps)
Overall 104.3 95.9
w/o sw avg pool 100.3 99.7
w/o fc 104.0 96.1
w/o PYNQ
API call 104.2 96.0
w/o sw shuffle 70.4 142.1
hw only 65.7 152.2
Table 9: Runtime Analysis for the First and Last
DiracDeltaNet Blocks in Different Operator Configura-
tions (Batch=10)
Runtime (ms)
Block1 Block2
feature map size 28 7
in&out channel 128 512
conv only 1.531 0.989
conv+pool 1.530 0.993
conv+shift 1.537 0.996
conv+shuffle 4.409 1.636
overall 4.364 1.441
overhead is proportional to HEIGHT ×WIDTH ×OC_TOTAL. But
total OPs HEIGHT ×WIDTH × IC_TOTAL×OC_TOTAL remains
the same, which means that smaller feature map needs less time
for memory copy. The memory copy overhead can be possibly
alleviated through running bare-metal C code on the CPU.
5 CONCLUSION AND FUTUREWORKS
In this paper, we adopt an algorithm-hardware co-design approach
to develop a ConvNet accelerator called Synetgy and a novel Con-
vNet model called DiracDeltaNet. Based on ShuffleNetV2, we opti-
mize the network’s operators by replacing all the 3×3 convolutions
with shift operations and 1×1 convolutions. This allows us to build
a compute unit exclusively customized for 1×1 convolutions for
better efficiency. We quantize the network’s weights to 4-bit and
activations to 4-bit fixed-point numbers with less than 1% accuracy
loss. These quantizations very well exploit the nature of FPGA hard-
ware. As a result, DiracDeltaNet has a small parameter size, low
computational OPs, hardware-friendly skip connections, low preci-
sion, and simplified operators. These features allow us to implement
highly customized and efficient accelerators on FPGA. We imple-
ment the network on Ultra96 Soc systems. The implementation
only took two people one month using HLS tools. Our accelera-
tor, Synetgy, achieves a top-5 accuracy of 88.1% on ImageNet, the
highest among all the previously published embedded FPGA accel-
erators. It also reaches an inference speed of 96.5 FPS, surpassing
prior works with similar accuracy by at least 16.9×. While we see
many more opportunities for further optimization, we believe this
demonstrates the efficacy of our co-design methodology.
For the future works, we will focus on further optimization. For
example, we can add more layers in the dataflow architecture to
improve the compute-to-communication ratio. Correspondingly,
we will need to adjust the network such that the computation
subgraphs are more symmetric.
ACKNOWLEDGMENTS
We would like to thank all of the people who helped us realize
this project, especially the anonymous reviewers, Kostadin Ilov,
Rock Qu, Alessandro Pappalardo, Amir Gholaminejad, Peter Jin,
Ravi Krishna, and Alvin Wan. The information, data, or work pre-
sented herein was funded in part by the Advanced Research Projects
Agency-Energy (ARPA-E), U.S. Department of Energy, under Award
Number DE-AR0000849. The Research was partially funded by
ADEPT Lab industrial sponsor Intel, and ADEPT Lab affiliates
Google, Siemens, and SK Hynix. The views and opinions of au-
thors expressed herein do not necessarily state or reflect those of
the United States Government or any agency thereof.
REFERENCES
[1] Kiseok Kwon, Alon Amid, Amir Gholami, Bichen Wu, Krste Asanovic, and Kurt
Keutzer. Co-design of deep neural nets and neural net accelerators for embedded
vision applications. arXiv preprint arXiv:1804.10642, 2018.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings
in deep residual networks. In European conference on computer vision, pages
630–645. Springer, 2016.
[3] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.
Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
[4] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning
transferable architectures for scalable image recognition. arXiv:1707.07012, 2017.
[5] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convo-
lutions. arXiv preprint arXiv:1511.07122, 2015.
[6] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from
tiny images. Technical report, Citeseer, 2009.
[7] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In Advances in neural information
processing systems, pages 1097–1105, 2012.
[8] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for
large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
9
[9] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J
Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
[10] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter Jin,
Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardware-aware neural network
design. arXiv preprint arXiv:1803.10615, 2018.
[11] Bichen Wu, Alvin Wan, Xiangyu Yue, Peter Jin, Sicheng Zhao, Noah Golmant,
Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. Shift: A zero flop, zero
parameter alternative to spatial convolutions. arXiv:1711.08141, 2017.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.
[13] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2:
Practical guidelines for efficient cnn architecture design. arXiv:1807.11164, 2018.
[14] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A
large-scale hierarchical image database. In Computer Vision and Pattern Recogni-
tion, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
[15] Bichen Wu, Forrest N Iandola, Peter H Jin, and Kurt Keutzer. Squeezedet: Uni-
fied, small, low power fully convolutional neural networks for real-time object
detection for autonomous driving. In CVPR Workshops, pages 446–454, 2017.
[16] Bichen Wu, Alvin Wan, Xiangyu Yue, and Kurt Keutzer. Squeezeseg: Convolu-
tional neural nets with recurrent crf for real-time road-object segmentation from
3d lidar point cloud. arXiv preprint arXiv:1710.07368, 2017.
[17] BichenWu, Xuanyu Zhou, Sicheng Zhao, Xiangyu Yue, and Kurt Keutzer. Squeeze-
segv2: Improved model structure and unsupervised domain adaptation for road-
object segmentation from a lidar point cloud. arXiv:1809.08495, 2018.
[18] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[19] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-
Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
4510–4520, 2018.
[20] X Zhang, X Zhou, M Lin, and J Sun. Shufflenet: An extremely efficient convolu-
tional neural network for mobile devices. arXiv:1707.01083.
[21] H. Zhong, X. Liu, Y. He, and Y. Ma. Shift-based Primitives for Efficient Convolu-
tional Neural Networks. ArXiv e-prints, September 2018.
[22] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing
deep neural networks with pruning, trained quantization and huffman coding.
arXiv preprint arXiv:1510.00149, 2015.
[23] Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally. Trained ternary
quantization. arXiv preprint arXiv:1612.01064, 2016.
[24] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-
net: Imagenet classification using binary convolutional neural networks. In
European Conference on Computer Vision, pages 525–542. Springer, 2016.
[25] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou.
Dorefa-net: Training low bitwidth convolutional neural networks with low
bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
[26] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang,
Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized
clipping activation for quantized neural networks. arXiv:1805.06085, 2018.
[27] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. Reid. Towards Effective Low-bitwidth
Convolutional Neural Networks. arXiv preprint arXiv:1711.00205, 2017.
[28] Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong.
Optimizing fpga-based accelerator design for deep convolutional neural net-
works. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pages 161–170. ACM, 2015.
[29] Jialiang Zhang and Jing Li. Improving the performance of opencl-based fpga accel-
erator for convolutional neural network. In Proceedings of the 2017 International
Symposium on Field-Programmable Gate Arrays, pages 25–34, 2017.
[30] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jae-sun Seo. Optimizing loop operation
and dataflow in fpga acceleration of deep convolutional neural networks. In Pro-
ceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays, pages 45–54. ACM, 2017.
[31] Chi Zhang and Viktor Prasanna. Frequency domain acceleration of convolutional
neural networks on cpu-fpga shared memory system. In Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
35–44. ACM, 2017.
[32] Eriko Nurvitadhi, Ganesh Venkatesh, Jaewoong Sim, Debbie Marr, Randy Huang,
Jason Ong Gee Hock, Yeong Tat Liew, Krishnan Srivatsan, Duncan Moss, Suchit
Subhaschandra, et al. Can fpgas beat gpus in accelerating next-generation deep
neural networks? In Proceedings of the ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, pages 5–14. ACM, 2017.
[33] Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. A
high performance fpga-based accelerator for large-scale convolutional neural net-
works. In Field Programmable Logic and Applications (FPL), 2016 26th International
Conference on, pages 1–9. IEEE, 2016.
[34] Utku Aydonat, Shane O’Connell, Davor Capalija, Andrew C Ling, and Gordon R
Chiu. An opencl deep learning accelerator on arria 10. In Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
55–64. ACM, 2017.
[35] Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu,
Yun Liang, and Jason Cong. Automated systolic array architecture synthesis for
high throughput cnn inference on fpgas. In Proceedings of the 54th Annual Design
Automation Conference 2017, page 29. ACM, 2017.
[36] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella, Michaela Blott, Philip
Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable bi-
narized neural network inference. In Proceedings of the ACM/SIGDA International
Symposium on Field-Programmable Gate Arrays, pages 65–74. ACM, 2017.
[37] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-Hau Lin, Mani
Srivastava, Rajesh Gupta, and Zhiru Zhang. Accelerating binarized convolu-
tional neural networks with software-programmable fpgas. In Proceedings of the
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages
15–24. ACM, 2017.
[38] Hiroki Nakahara, Tomoya Fujii, and Shimpei Sato. A fully connected layer
elimination for a binarizec convolutional neural network on an fpga. In Field
Programmable Logic and Applications (FPL), 2017 27th International Conference on,
pages 1–4. IEEE, 2017.
[39] Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng
Yu, Tianqi Tang, Ningyi Xu, Sen Song, et al. Going deeper with embedded fpga
platform for convolutional neural network. In Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pages 26–35, 2016.
[40] Kaiyuan Guo, Song Han, Song Yao, Yu Wang, Yuan Xie, and Huazhong Yang.
Software-hardware codesign for efficient neural network acceleration. IEEE
Micro, 37(2):18–25, 2017.
[41] Michaela Blott, Thomas Preusser, Nicholas Fraser, Giulio Gambardella, Kenneth
O’Brien, and Yaman Umuroglu. Finn-r: An end-to-end deep-learning framework
for fast exploration of quantized neural networks, 2018.
[42] Li Jiao, Cheng Luo, Wei Cao, Xuegong Zhou, and Lingli Wang. Accelerating
low bit-width convolutional neural networks with embedded fpga. In Field
Programmable Logic and Applications (FPL), 2017 27th International Conference on,
pages 1–4. IEEE, 2017.
[43] Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi.
Label refinery: Improving imagenet classification through label progression.
arXiv preprint arXiv:1805.02641, 2018.
[44] Shaoyi Cheng and John Wawrzynek. High level synthesis with a dataflow
architectural template, 2016.
[45] Xilinx. Vivado Design Suite User Guide - High-Level Synthesis (UG902), 2018.
[46] Samuel Williams, Andrew Waterman, and David Patterson. Roofline: An in-
sightful visual performance model for floating-point programs and multicore
architectures. Communications of the Association for Computing Machinery, 2009.
[47] Xilinx. PYNQ Introduction, 2018. https://pynq.readthedocs.io/en/v2.3/.
[48] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun Wei. Fp-bnn:
Binarized neural network on fpga. Neurocomputing, 275:1072–1086, 2018.
[49] Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma
Vrudhula, Jae-sun Seo, and Yu Cao. Throughput-optimized opencl-based fpga
accelerator for large-scale convolutional neural networks. In Proceedings of the
2016 International Symposium on Field-Programmable Gate Arrays, pages 16–25.
ACM, 2016.
10
