Accelerating Deterministic and Stochastic Binarized Neural Networks on
  FPGAs Using OpenCL by Lammie, Corey et al.
978-1-7281-0397-6/19/$31.00 2019 IEEE
Accelerating Deterministic and Stochastic Binarized
Neural Networks on FPGAs Using OpenCL
Corey Lammie, Wei Xiang, and Mostafa Rahimi Azghadi
College of Science and Engineering, James Cook University, Queensland 4814, Australia
Email:{corey.lammie, wei.xiang, mostafa.rahimiazghadi}@jcu.edu.au
Abstract—Recent technological advances have proliferated the
available computing power, memory, and speed of modern Cen-
tral Processing Units (CPUs), Graphics Processing Units (GPUs),
and Field Programmable Gate Arrays (FPGAs). Consequently,
the performance and complexity of Artificial Neural Networks
(ANNs) is burgeoning. While GPU-accelerated Deep Neural
Networks (DNNs) currently offer state-of-the-art performance,
they consume large amounts of power. Training such networks
on CPUs is inefficient, as data throughput and parallel com-
putation is limited. FPGAs are considered a suitable candidate
for performance critical, low power systems, e.g. the Internet of
Things (IOT) edge devices. Using the Xilinx SDAccel or Intel
FPGA SDK for OpenCL development environment, networks
described using the high-level OpenCL framework can be ac-
celerated on heterogeneous platforms. Moreover, the resource
utilization and power consumption of DNNs can be further
enhanced by utilizing regularization techniques that binarize
network weights. In this paper, we introduce, to the best of our
knowledge, the first FPGA-accelerated stochastically binarized
DNN implementations, and compare them to implementations ac-
celerated on both GPUs and FPGAs. All our developed networks
are trained and benchmarked using the popular MNIST and
CIFAR-10 datasets. For our binarized and conventional FPGA-
based networks, we achieve a >16-fold improvement in power
consumption, compared to their GPU-accelerated counterparts.
Also, our binarized FPGA-based networks require >25% shorter
inference times, compared to their GPU-based counterparts.
I. INTRODUCTION
DEEP Neural Network (DNN) architectures have becomeintegral to a variety of applications in Artificial Intelli-
gence (AI) and Machine Learning (ML). While these learning
networks and their underpinned elements have been actively
researched since 1974 [1], the inception of recent modern
GPUs and faster CPU architectures have greatly facilitated
Neural Network (NN) research and enabled the development
of highly accurate and complex DNNs.
However, high-performance CPU- and GPU-accelerated
DNNs are putative to consume large amounts of power.
As a result, accelerating DNNs on low-power and resource-
constrained devices, such as portable smart electronics and
IOT edge devices, becomes formidable. Considerable efforts
are currently being made to utilize customized hardware
solutions using FPGAs, presenting significant reductions in
power consumption using both Fully Connected (FC) and
Convolutional Neural Networks (CNNs) [2]–[4].
Despite the many improvements that recent FPGA studies
offer in boosting parallelism and power efficiency, due to
the large number of high-resolution multiplications required
during learning and inference, such accelerated implementa-
tions are inhibited by the amount of dedicated multipliers and
Digital Signal Processing (DSP) blocks available on FPGAs.
Therefore, new techniques have recently been developed to
account for the limited hardware resources available. A very
popular technique, which quantizes network weights to binary
states, has been proposed in order to greatly reduce resource
utilization, and as a result, power consumption, while ex-
hibiting minimal performance degradation [5]. Within these
networks, denoted Binarized Neural Networks (BNNs), many
resource-hungry multiply-accumulate operations, required dur-
ing learning and inference, are replaced with simple accumu-
lations.
Deterministic [6], stochastic [5], and recursive [7] BNNs
binarize weights during forward and backwards learning
propagation cycles, while retaining precision of the stored
weights to which gradients are accumulated. Self-binarizing
networks [7] train using a unique representation of network
weights, involving a smooth activation function, which is
iteratively sharpened during training until it becomes a binary
representation equivalent to the sign activation function.
While hardware implementations of deterministic BNNs are
plentiful [3], [8], to the best of our knowledge, there are no
current FPGA implementations of stochastic BNNs. Therefore,
here we propose the first FPGA implementations of stochastic
BNNs, as it has been demonstrated that stochastic BNNs
further improve the learning performance of BNNs, compared
to their deterministic counterparts [5].
In addition, we provide comprehensive results through in-
vestigating the acceleration of deterministic and stochastic
BNNs on both GPUs and FPGAs using High Level Synthesis
(HLS) techniques utilizing the OpenCL framework, to en-
courage deployment using heterogeneous platforms. Resource
usage and performance of the implemented networks are
also compared for permutation-invariant DNNs, and CNNs,
trained and tested for MNIST [9] and CIFAR-10 [10]. For all
our hardware implementations, we draw comparisons among
designs utilizing deterministic, stochastic, or no regularization
techniques. Our specific contributions are as follows:
• We implement and present the first FPGA-accelerated
stochastically binarized DNNs and CNNs.
• We employ complete FPGA-accelerated DNNs and
CNNs on a standalone System On a Chip (SoC), requiring
no host computer or extra device for partial computation.
ar
X
iv
:1
90
5.
06
10
5v
1 
 [c
s.L
G]
  1
5 M
ay
 20
19
• We demonstrate that our new binarized FPGA-accelerated
DNNs and CNNs offer significantly reduced power usage
and shorter inference times, compared to their equivalent
full resolution counterparts, on MNIST and CIFAR-10,
implemented on both GPU and FPGA.
• We report and investigate the learning times required for
all of our implemented networks.
II. PRELIMINARIES
This section briefly reviews and presents the algorithms and
methods used in our developed networks for the MNIST and
CIFAR-10 classification benchmarks.
A. Binary Weight Regularization
Binary weight regularization [5], constrains network
weights to binary states of {+1, -1}, during forward and
backward propagations. The binarization operation transforms
the full-precision weights into binary values, using either a
deterministic or a stochastic approach.
1) Deterministic Binarization: Deterministic binarization is
defined in Equation (1).
wb =
{ −1 if w ≤ 0
+1 otherwise, (1)
where wb is the binarized weight and w is the real-valued
full-precision weight.
2) Stochastic Binarization: Stochastic binarization is an
alternative binarization technique, which stochastically bina-
rizes weights. The stochastic binary projection is presented in
Equation (2).
wb =
{
+1 with probability ρ = σ(w),
−1 with probability 1− ρ, (2)
where σ is the hard sigmoid function described in Equation
(3).
Algorithm 1 Training Algorithm of the Accelerated Binarized
Neural Networks
Input: a mini-batch of (inputs, targets), previous parameters
wt−1 and bt−1, and a learning rate η.
Output: updated parameters wt and bt.
1: Forward Propogation
wb ← binarize (wt−1).
For k = 1 to L, compute ak knowing ak−1, wb, bt−1.
2: Backward Propogation
Initialize output layer’s activation gradient ∂C∂aL
For k = L to 2, compute ∂C∂ak−1 using
∂C
∂ak
and wb.
3: Parameter Update
Compute ∂C∂wb and
∂C
∂dbt−1
, using ∂C∂ak and ak−1.
wt ← clip(wt−1 − η ∂C∂wb )
bt ← bt−1 − η ∂C∂bt−1 .
4: Weight Normalization
w ← clip(w)
σ(x) = clip(
x+ 1
2
, 0, 1) = max(0,min(1,
x+ 1
2
)). (3)
B. Training Algorithm
Algorithm (1) provides a high-level overview of the training
algorithm used for deterministic and stochastic BNNs. Here,
w, b, and η represent the weights, biases, and learning rate,
while C denotes the cost function for each mini-batch. Further-
more, wb represents binary weights and ak represents the kth
layer activation function, while binarize() implements Equa-
tion (1) or (2) depending the utilized regularization, and clip()
clips values between −1 and +1. By adopting this training
algorithm, during learning and inference, network outputs can
be determined using simple Multiply and Accumulate (MAC)
operations, in-place of dedicated multiplier blocks [6].
III. NETWORK ARCHITECTURE
The complete architecture of the implemented networks
consists of two main components: A. The Software Architec-
ture, and B. The Hardware Architecture. The software archi-
tecture defines the targeted neural network structure, which
will be described in C++ and OpenCL kernels. The hardware
architecture describes the integration between the hardware
used to run the OpenCL kernels and a host controller, which
is the program executed on a host processor. This processor is
used to launch OpenCL kernels and to manage device memory.
A. Software Architecture
We implement two distinct neural network architectures: a
permutation-invariant FC network for MNIST, and the VGG-
16 [11] CNN for CIFAR-10. Details pertaining to each net-
work are provided in a publicly available GitHub repository1.
To decrease the quantization error, which binarization in-
troduces, the output of each layer is normalized using batch
normalization. The output of the final layer is fed through
a Softmax activation layer, and the network’s loss is mini-
mized using cross-entropy. SGD with momentum is used to
optimize the network parameters, with a initial learning rate,
η[0] = 0.001, and momentum, m = 0.9. In order to accelerate
convergence, and maximize each networks’ performance, an
adaptive decaying learning rate, η, is used, as described in
Equation (4).
η[epoch] = η[epoch− 1] · 0.01epoch100 (4)
B. Hardware Architecture
The developed hardware architectures consist of C++ host
controllers and multiple OpenCL kernels, which are acceler-
ated using either an FPGA or a GPU. For x86-based systems,
OpenCL accelerated kernels using FPGAs typically reside
on an FPGA development board, which is connected to a
1https://github.com/coreylammie/Accelerating-Stochastically-Binarized-
Neural-Networks-on-FPGAs-using-OpenCL
SoC [HPS + FPGA]
C/C++ 
Host Code
GCC C/C++ 
Compiler
OpenCL 
Kernel Code
Intel FPGA 
SDK for OpenCL
Processor 
System 
FPGA 
Accelerator
Hard
FPGA <-> HPS Bridge
Fig. 1. Top level flow diagram of the proposed network implemented on a
SoC consisting of a processor describing the host controller and an FPGA to
run OpenCL kernels.
separate independent host system through the PCIe express
interface [2]. For ARM-based systems, the FPGA is typically
connected to a Hard Processor System (HPS) on a SoC
through specialized bridges, as in the case of the Intel DE1-
SoC development board that we used here. This allows the
proposed networks to be run completely independently on the
SoC without using a separate device for computation. The full
top level flow diagram of our implemented FPGA-accelerated
networks is presented in Figure (1).
In addition to accelerating the targeted MNIST and CIFAR-
10 networks on a FPGA development board, each network is
also accelerated on a state-of-the-art Titan V GPU to execute
OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz
Overclocked (OC) CPU to drive the operating system.
IV. IMPLEMENTATION RESULTS
In order to validate and investigate the performance of the
proposed FPGA- and GPU-accelerated BNN architectures, the
MNIST and CIFAR-10 datasets are used. To ensure a fair
comparison, on account of the limited resource availability
of the Intel DE1-SoC development board used, the batch size,
Epoch
Va
lid
at
io
n 
A
cc
ur
ac
y 
(%
)
100
95
90
85
80
75
70
65
60
0 20 40 60 80 100 120 140 160 180 200
No Regularizer
Deterministic
Stochastic
Fig. 2. Validation Accuracy during training for FPGA- and GPU-accelerated
permutation-invariant FCNN for the MNIST test set. Solid lines represent
the validation error for networks accelerated using FPGA and dashed lines
represent the validation error for networks accelerated using GPU.
=, was fixed to 4 for all networks. The validation accuracy for
all developed networks over 200 training epochs is presented
in Figures (2) and (3).
From Figures (2) and (3), it can be observed that both GPU-
and FPGA-accelerated networks achieve very similar valida-
tion accuracy rates during learning. For all implementations,
regularized networks require more training epochs to converge.
The variations in validation accuracy trends reported be-
tween platforms can be associated to the different initial
weights generated using the He weight initialization tech-
nique. Figures (2) and (3) also demonstrate that, networks
employing stochastic and deterministic binarization techniques
perform very similarly, compared to their base-line architec-
tures employing no binary regularization techniques. For our
FPGA-accelerated networks learning MNIST, the validation
accuracy degrades by only 0.37% (for stochastic) and 0.94%
(for deterministic), compared to no regularization. For our
FPGA-accelerated networks learning CIFAR-10, a validation
accuracy decrease of 0.24% was observed for our network
employing deterministic binarization, while our networks with
stochastic binarization regularization showed a validation ac-
curacy increase of 0.03%. These findings are in good agree-
ment with the software implementations of the binarized
networks reported in [5].
To comprehensively compare the implemented FPGA- and
GPU-accelerated networks, the total kernel power usages,
learning time per epoch, inference time per image, and learn-
ing performances were determined and presented in Table I.
The total kernel power usages were determined using the
Post Place & Route Estimator for FPGA post-synthesis, and
NVIDIA-SMI for GPU. It was found that the power consump-
tion of all FPGA-accelerated networks reduce by >16 times,
compared to their GPU-accelerated counterparts.
Despite this drastic reduction in power consumption, our
deterministic and stochastic regularized FPGA-accelerated
networks require similar training durations to their GPU-
accelerated counterparts, which have much higher operation
frequencies, compared to our utilized FPGA. As reported
Epoch
0 20 40 60 80 100 120 140 160 180 200
90
85
80
75
70
65
60
55
50
Va
lid
at
io
n 
A
cc
ur
ac
y 
(%
)
No Regularizer
Deterministic
Stochastic
Fig. 3. Validation Accuracy for each Epoch during training for FPGA-
and GPU-accelerated VGG-16 CNN for the CIFAR-10 test set. Solid lines
represent the validation error for networks accelerated using FPGA and dashed
lines represent the validation error for networks accelerated using GPU.
TABLE I
IMPLEMENTATION RESULTS OBTAINED USING THE MNIST AND CIFAR-10 DATASETS FOR GPU AND FPGA ACCELERATED NETWORKS. THE LEARNING
TIME PER EPOCH AND INFERENCE TIME PER IMAGE METRICS ARE AVERAGED OVER ALL RECORDED SAMPLES DURING 200 TRAINING EPOCHS.
Regularizer Total Kernel Power Usages (W) Learning Time per Epoch (s) Inference Time per Image (s) Validation Accuracy (%)FPGA GPU FPGA GPU FPGA GPU FPGA GPU
MNIST
No Regularizer 7.0 126.1 26.09 5.13 7.04E-05 3.12E-05 98.70 98.54
Deterministic 6.3 125.9 9.75 8.87 6.84E-06 9.71E-06 97.76 97.94
Stochastic 6.3 125.4 11.58 8.20 7.12E-06 9.92E-06 98.33 98.23
CIFAR-10
No Regularizer 7.9 128.4 43.97 28.45 1.15E-02 5.09E-03 86.72 86.73
Deterministic 6.5 126.3 16.91 34.86 1.11E-03 1.63E-03 86.48 86.46
Stochastic 6.6 126.9 20.08 33.79 1.16E-03 1.66E-03 86.75 86.76
in Table I, our FPGA-accelerated permutation-invariant FC
stochastic and deterministic BNNs for MNIST require 1.10×
and 1.41× longer training intervals, respectively. Our FPGA-
accelerated CNNs adopting the VGG-16 architecture acceler-
ate learning by 2.06× and 1.68×, respectively. These findings
are in agreement with [6], which investigates execution times
for FC and CNN BNNs, and demonstrates that convolutional
operations are accelerated to a greater extent than matrix
multiplications, which are required for FC layers.
When considering the inference time, all our FPGA-
accelerated stochastic and deterministic regularized networks
require shorter times to perform inference, compared to their
GPU-accelerated counterparts. This is notable, considering
our GPU-accelerated networks use the state-of-the-art Titan V
GPU to execute OpenCL kernels, while the limited resources
available on the utilized FPGA creates a large bottleneck
on the maximum synthesizable frequency, and thus limits
the speed of our FPGA-accelerated networks. We believe
the shorter inference times observed are mainly due to the
binarized parameters during inference, which accelerate the
required computations. This also explains why, when no regu-
larizer is used, our GPU-accelerated implementations require
shorter inference times than our FPGA-accelerated implemen-
tations. Modern FPGAs such as the Stratix V GXA7 and
Virtex-7 VX485T, used in other recent works [2], [12], are
expected to demonstrate even more significant improvements
in speed during training and inference. This promises further
inference acceleration for FPGA-based deterministically and
stochastically binarized networks.
V. CONCLUSION
We designed and implemented various FC and convolutional
BNN architectures using the high-level OpenCL framework.
We then accelerated the developed networks on both GPUs
and FPGAs. The performance, power, and learning/inference
times of these network architectures were investigated. It was
found that both FPGA-accelerated BNNs with deterministic
and stochastic regularizers have reduced inference times on
MNIST and CIFAR-10 by an order of magnitude, compared
to the no-regularized FPGA case. They require >25% shorter
inference times than their GPU counterparts. Moreover, our
FPGA-accelerated BNNs consumed less than 16× the power
required by non-regularized GPU-accelerated networks. Fi-
nally, our BNNs achieved only slightly degraded validation er-
rors on MNIST, and in some instances, outperformed our base-
line non-regularized GPU-accelerated networks on CIFAR-10.
In summary, our modular and scalable FC and CNN network
architectures can be extrapolated to accelerate larger and more
complex networks.
REFERENCES
[1] H. Wang, B. Raj, and E. P. Xing, “On the origin of deep learning,”
arxiv.org/abs/1702.07800, 2017.
[2] D. Wang, K. Xu, and D. Jiang, “PipeCNN: An OpenCL-based open-
source FPGA accelerator for convolution neural networks,” in 2017
International Conference on Field Programmable Technology (ICFPT),
Dec 2017, pp. 279–282.
[3] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren, “A GPU-Outperforming FPGA
accelerator architecture for binary convolutional neural networks,” ACM
Journal on Emerging Technologies in Computing (JETC), To Appear.
[Online]. Available: https://arxiv.org/abs/1702.06392
[4] C. Lammie and M. R. Azghadi, “Stochastic Computing for Low-Power
and High-Speed Deep Learning on FPGA,” in 2019 IEEE International
Symposium on Circuits and Systems (ISCAS), May 2019. [Online].
Available: https://www.doi.org/10.1109/ISCAS.2019.8702248
[5] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training
deep neural networks with binary weights during propagations,” in
Advances in Neural Information Processing Systems, 2015, pp. 3123–
3131.
[6] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural
networks with weights and activations constrained to +1 or -1,”
arxiv.org/abs/1602.02830, 2016.
[7] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. R. Shanbhag,
“True gradient-based training of deep binary activated neural networks
via continuous binarization,” 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2018.
[8] L. Yang, Z. He, and D. Fan, “A fully onchip binarized convolutional
neural network fpga impelmentation with accurate inference,” in
Proceedings of the International Symposium on Low Power Electronics
and Design, ser. ISLPED ’18. New York, NY, USA: ACM, 2018,
pp. 50:1–50:6. [Online]. Available: http://doi.acm.org/10.1145/3218603.
3218615
[9] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
[Online]. Available: http://yann.lecun.com/exdb/mnist/
[10] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute
for advanced research).” [Online]. Available: http://www.cs.toronto.edu/
∼kriz/cifar.html
[11] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arxiv.org/abs/1409.1556, 2014.
[12] J.-H. Lin, T. Xing, R. Zhao, Z. Zhang, M. Srivastava, Z. Tu, and
R. K. Gupta, “Binarized convolutional neural networks with separable
filters for efficient hardware acceleration,” Computer Vision and Pattern
Recognition, 2017. [Online]. Available: https://arxiv.org/abs/1707.04693
