Training Progressively Binarizing Deep Networks Using FPGAs by Lammie, Corey et al.
978-1-7281-0397-6/20/$31.00 © 2020 IEEE
Training Progressively Binarizing Deep Networks
Using FPGAs
Corey Lammie, Wei Xiang, and Mostafa Rahimi Azghadi
College of Science and Engineering, James Cook University, Queensland 4814, Australia
Email:{corey.lammie, mostafa.rahimiazghadi, wei.xiang}@jcu.edu.au
Abstract—While hardware implementations of inference rou-
tines for Binarized Neural Networks (BNNs) are plentiful, current
realizations of efficient BNN hardware training accelerators,
suitable for Internet of Things (IoT) edge devices, leave much
to be desired. Conventional BNN hardware training accelerators
perform forward and backward propagations with parameters
adopting binary representations, and optimization using parame-
ters adopting floating or fixed-point real-valued representations–
requiring two distinct sets of network parameters. In this paper,
we propose a hardware-friendly training method that, contrary
to conventional methods, progressively binarizes a singular set
of fixed-point network parameters, yielding notable reductions
in power and resource utilizations. We use the Intel FPGA SDK
for OpenCL development environment to train our progressively
binarizing DNNs on an OpenVINO FPGA. We benchmark our
training approach on both GPUs and FPGAs using CIFAR-10
and compare it to conventional BNNs.
Index Terms—Deep Learning, Binarized Neural Networks,
Progressive Binarization, Deep Neural Networks, Convolutional
Neural Networks, CIFAR-10
I. INTRODUCTION
BINARIZATION has been used to augment the performanceof Deep Neural Networks (DNNs), by quantizing net-
work parameters to binary states, replacing many resource-
hungry multiply-accumulate operations with simple accumu-
lations [1]. It has been demonstrated that Binarized Neural
Networks (BNNs) implemented on customized hardware can
perform inference faster than conventional DNNs on state-
of-the-art Graphics Processing Units (GPUs) [2], [3], while
offering notable improvements in power consumption and
resource utilizations [4]–[6]. However, there is still a per-
formance gap between DNNs and conventional BNNs [7],
which binarize parameters deterministically or stochastically.
Moreover, the training routines of conventional BNNs are
inherently unstable [8].
During backward propagations of conventional BNN train-
ing routines, gradients are approximated using a Straight-
Through Estimator (STE) as the signum function is not
continuously differentiable [1]. The gap in performance, and
the general instability of conventional BNNs compared to
DNNs, can be largely attributed to the lack of an accurate
derivative for weights and activations in BNNs, which creates
a mismatch between binary and floating- or fixed-point real-
valued representations [9].
The training routines of DNNs that utilize continuously
differentiable and adjustable functions in place of the signum
function, which we denote Progressively Binarizing NNs
(PBNNs), transform a complex and non-smooth optimization
problem into a sequence of smooth sub-optimization problems.
Such training routines that progressively binarize network pa-
rameters, were first used to binarize the last layer of DNNs to
yield significant multimedia retrieval performance on standard
benchmarks [10]. Since, various works have detailed training
routines of complete PBNNs [11]–[13]. However, efficient
customized hardware implementations of PBNNs are yet to
be explored.
In this paper, we use the Intel FPGA SDK for OpenCL
development environment to implement and train novel and
scalable PBNNs on an OpenVINO FPGA, which progressively
binarize a singular set of fixed-point network parameters. We
compare our approach to conventional BNNs and benchmark
our implementations using CIFAR-10 [14]. Our specific con-
tributions are as follows:
• We implement and present the first PBNNs using cus-
tomized hardware and fixed-point number representa-
tions;
• We use a Piece-Wise Linear (PWL) function for binariza-
tion and activations with a constant derivative to simplify
computations;
• We demonstrate compared to training BNNs determin-
istically or stochastically on CIFAR-10, PBNNs yield a
marginal, yet consistent, increase in classification accu-
racy, and decrease both resource and power utilizations.
II. PRELIMINARIES
A. Conventional BNNs
The training routines of conventional BNNs binarize pa-
rameters either deterministically or stochastically after per-
forming parameter optimizations. Deterministic binarization is
performed as per Eq. (1).
θb =
{ −1 if θ ≤ 0
+1 if θ > 0, (1)
where θb denotes binarized parameters and θ denotes real-
valued full-precision parameters. Stochastic binarization is
performed as per Eq. (2), where σ is the hard sigmoid function
described in Eq. (3).
θb =
{
+1 with probability ρ = σ(θ),
−1 with probability 1−ρ (2)
ar
X
iv
:2
00
1.
02
39
0v
1 
 [c
s.C
V]
  8
 Ja
n 2
02
0
Fig. 1: Depiction of (A) network parameter representations and (B) binarization and activation functions required during training
for various DNNs and BNNs. Different levels of discretization are depicted using various shade palettes. DNNs require one
set of real-valued (A1) or limited-precision parameters (A2) and typically have continuously differentiable activation functions
(B1). In addition to the real-valued parameters (A3), BNNs also require another set of binarized parameters (A4), for which
they use a STE (B2) to determine gradients of the signum function, which is not continuously differentiable. In contrast to
BNNs, PBNNs require one set of real-valued parameters (A5) and use a continuously differentiable activation and binarization
function, with a shape that progressively evolves during training (B3).
σ(θ) = clip(
θ +1
2
,0,1) = max(0,min(1,
θ +1
2
)) (3)
During backward propagations, large parameters are clipped
using tclip, as per Eq. (4), where J denotes the objective
function.
∂J
∂θ
=
∂J
∂θb
1|θ |≤tclip (4)
B. Progressively Binarizing DNNs
PBNNs use a set of constrained real-valued parameters θl at
each layer l, which are not directly learnable, but are a function
of learnable parameters P at each layer [11], [12]. The shape of
θ(P) evolves during training, to closely resemble the signum
function once training is complete. The hyperbolic tangent
function is commonly used to relate θ and P, as described
in Eq. (5).
θ(P) = tanh(v ·P), (5)
where v is an adjustable scale parameter, which is used to
evolve the shape of Eq. (5). The derivative of Eq. (5) is
described in Eq. (6).
∂ tanh(v ·P)
∂P
= v · (1− tanh2(v ·P)). (6)
As v increases, the shape of Eq. (5) better mimics that
of the signum function. During training, parameters, denoted
using P, are optimized to minimize a loss function, while
v is progressively increased. After training is completed, v
is sufficiently large that the parameters, θ , are very close to
∈ [−1,1], as depicted in Fig. 1. The final binary parameters
can simply be obtained by passing θ(P) through the signum
function.
III. IMPLEMENTATION DETAILS
A. Our Progressive Binarization Training Routine
We employ a PWL function to approximate the hyperbolic
tangent function, described in Eq. (7) and depicted in Fig. 2
-1 -0.5 0 0.5 1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1 v=1
v=2
v=100
P
θ
Fig. 2: Activation and binarization functions used for our
progressive binarization training routine.
to simplify computations. In addition to reducing a non-linear
function to a linear function, the derivative of Eq. (7), when
bounded, is constant and does not depend on P. Consequently,
when all activations are computed simultaneously, the output
of each layer during forward propagations, al[n], does not
need to be stored in memory to determine gradients during
backward propagations.
θ(P) =
 −1 if P <−1v ·P if −1≤ P≤ 1
+1 if P > 1
(7)
Algorithm 1 provides a high-level overview of our progres-
sive binarization training routine. Trained binary parameters
can be computed after each training epoch to determine
performance on the test set during training. Here, al[n] denotes
the output of the lth layer at the nth epoch. As v 1 the sign
of the output of Batch Normalization (BN) is reformulated to
reduce computation as per Eq. (8) [12].
sign(al[n]) = XNOR(I > T,γ > 0), (8)
Algorithm 1 The training rotuine adopted by all of our
progressively-binarizing DNNs.
Input: Network hyperparameters (the learning rate schedule,
η , scale parameter schedule, v, batch size, ℑ, gradient
optimizer, loss function, J(θ ,y′i(a−1),yi), and the number
of training epochs).
Output: Trained binary weights and biases, θb.
for each training epoch do
1. Forward Propagation
η ,v = η [epoch],v[epoch]
for each training batch do
for each layer do
Determine al[n] = θl[n−1](Pl[n−1])
end for
end for
2. Backward Propagation
Determine J(θ ,y′i(a−1),yi)
for all other layers do
Determine ∂J∂al−1[n] using
∂J
∂al[n]
and θl[n−1]
end for
3. Parameter Optimization
for each layer do
Determine ∂J∂θl[n−1] using
∂J
∂al[n]
Determine θ [n] using ∂J∂θl[n−1] and η
end for
end for
4. Determine the Trained Binary Parameters
θb = []
for each layer do
θb = concat(θb,sign(θl))
end for
where T is defined in Eq. (9). I denotes the input, β and γ are
parameters that define an affline transform, and ur and σr are
the running mean and standard deviation of the feature maps
that pass through them.
T = µr− σr ·βγ (9)
We trained all networks until improvement on the test set
was negligible (for 50 epochs) with a batch size ℑ= 8. This is
the largest possible batch size that makes comparison across
devices possible. The initial learning rate was η = 1e− 3,
which was decayed by an order of magnitude every 20 training
epochs, i.e. when mod(η ,20) = 0.
During training, each network’s scale parameter, v, was
increased logarithmically, from 1, at the first epoch, to 1000, at
the final epoch. Eq. (8) was used to determine the output of all
batch normalization layers when v≥ 500. Adam [15] was used
to optimize network parameters and Cross Entropy (CE) [16]
was used to determine network losses. After the trained binary
parameters were determined, for all our implementations, a
conventional OpenCL BNN inference accelerator was used to
perform inference on the CIFAR-10 test set.
B. Network Architecture
The network architecture, previously used in [17], was used
in all of our DNNs. This architecture is a variant of the
VGG [18] family of network architectures. It is summarized
in Table I. For each convolutional and pooling layer, f denotes
the number of filters, k determines the filter size, s is the
stride length, and p denotes the padding. Here, N is the
number of output neurons for each fully connected layer.
All convolutional and fully connected layers are sequenced
with batch normalization and activation layers. The last fully
connected layer adopts real-valued representations.
C. Hardware Architecture
All of our implementations are described using the hetero-
geneous OpenCL [20] framework, in which multiple OpenCL
kernels are accelerated using either FPGAs or GPUs that
are controlled using C++ host controllers. For FPGA im-
plementations a SoC is used as the host controller, whereas
for GPU implementations a CPU is used. We note that the
TABLE I: Adopted Network Architecture.
Layer Output Shape Binarized
Convolutional, f = 128,k = 3,s = 1, p = 1 (128×32×32) 3
Convolutional, f = 128,k = 3,s = 1, p = 1 (128×32×32) 3
Max Pooling, k = 2, p = 2 (128×16×16) 3
Convolutional, f = 128,k = 3,s = 1, p = 1 (128×16×16) 3
Convolutional, f = 256,k = 3,s = 1, p = 1 (256×16×16) 3
Max Pooling, k = 2, p = 2 (256×8×8) 3
Convolutional, f = 256,k = 3,s = 1, p = 1 (256×8×8) 3
Convolutional, f = 512,k = 3,s = 1, p = 1 (512×8×8) 3
Max Pooling, k = 2, p = 2 (512×4×4) 3
Fully Connected, N = 1024 (1024) 3
Fully Connected, N = 1024 (1024) 3
Fully Connected, N = 10 (10)
TABLE II: Implementation results obtained using the CIFAR-10 dataset for GPU and FPGA accelerated networks. 1The mean
and standard deviations reported for the Training Time per Epoch (s) metric are determined over 50 training epochs. 2Similarly
to conventional networks, the unbounded ReLU [19] activation function was used instead of Eq. (7) for the real-valued FP-32
baseline implementation on GPU. 3The same test set accuracy was achieved for GPU and FPGA implementations.
Training Routine Total Kernel Power Usages (W) Total Training Time (s) Training Time per Epoch (s)
1 Test Set Accuracy (%)3
FPGA GPU FPGA GPU FPGA GPU FPGA GPU
8-bit Fixed Point
Stochastic 8.06 133.9 1,592.73 2,613,67 31.85±0.21 52.27±0.37 85.91 85.91
Deterministic 7.95 133.0 1,523.17 2,497.72 30.46±0.18 49.95±0.31 85.56 85.56
Progressive 7.60 130.3 1,383.17 2,315.97 27.66±0.17 46.31±0.32 86.28 86.28
16-bit Fixed Point
Stochastic 10.19 134.2 1,989.25 3,147.23 39.78±0.17 62.94±0.31 86.45 86.45
Deterministic 10.03 132.8 1,907.17 2,909.62 38.14±0.19 58.19±0.36 86.16 86.16
Progressive 9.27 130.5 1,729.32 2,685.22 34.58±0.22 53.70±0.34 86.94 86.94
FP32 Baseline
Real-valued2 — 137.1 — 2,524.20 — 50.48±0.35 — 86.77
power consumption of our FPGA implementations could be
further decreased by realizing them using Hardware Descrip-
tion Language (HDL), removing the host controller, however,
this would make fair comparisons between GPU and FPGA
implementations difficult [21].
IV. IMPLEMENTATION RESULTS
In order to investigate the performance of our progressively
binarizing training routine, CIFAR-10 was used. Prior to
training, the color channels of each image were normalized
using mean and standard deviation values of (0.4914, 0.2023),
(0.4822, 0.1994), and (0.4465, 0.2010), for the red, green,
and blue image channels, respectively. This normalization
was performed because it has demonstrated significant per-
formance on the ImageNet dataset [22]. We compare FPGA
implementations adopting 16-bit and 8-bit fixed-point real-
valued representations, as a large degradation in performance
was observed when using smaller bit widths.
To compile OpenCL kernels for the OpenVINO FPGA, the
Intel FPGA SDK for OpenCL Offline Compiler (IOC) was
TABLE III: Comparison of device FPGA utilization for
various binarization training approaches. The numbers are
extracted from acl_quartus_report.txt, generated by Quartus
Prime Design Suite 18.1.
Training Routine Deterministic Stochastic Progressive
Device Intel FPGA OpenVINO
Dataset CIFAR-10
8-bit Fixed Point
Flip Flops (%) 63.19 66.42 62.95
ALMs (%) 81.38 84.87 76.92
DSPs (%) 100.00 100.00 93.20
16-bit Fixed Point
Flip Flops (%) 96.06 98.43 91.96
ALMs (%) 90.40 94.31 85.54
DSPs (%) 100.00 100.00 100.00
used, as part of the Intel FPGA SDK for OpenCL and Quartus
Prime Design Suite 18.1. For our GPU implementations, a
Titan V GPU was used to execute OpenCL kernels and an
AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU was
used to drive the host controller. We used version 430.50 of
the Titan V GPU driver to launch compute kernels. We report
all GPU and FPGA implementation results in Table II.
From Table II, it can be observed that our progressive
training routine consumed the least power and had the smallest
total training time on FPGA. Moreover, when adopting 16-
bit fixed-point real-valued representations during training it
achieved the largest test set accuray. We believe that, similarly
to [1], this can be attributed to the additional regularization
that binarized parameters introduce. We note that the total
training times of our GPU and FPGA implementations are
not indicative of those with larger batch sizes, and that the
available resources on the FPGA used, restricted us to use
ℑ= 8 across all devices.
The device utilization of our FPGA implementations is
presented in Table III. Our progressive binarizing training rou-
tine consumes notably less Adaptive Logic Modules (ALMs)
and Flip Flops than deterministic and stochastic routines for
both 16- and 8-bit fixed-point representations. Digital Signal
Processor (DSP) utilization is similar to deterministic and
stochastic routines, and is only decreased marginally when
8-bit fixed-point real-valued representations are adopted.
V. CONCLUSION
We proposed and implemented novel and scalable PBNNs
on GPUs and FPGAs. We compared our approach to conven-
tional BNNs and real-valued DNNs using GPUs and FPGAs
and demonstrated notable reductions in power and resource
utilizations for CIFAR-10. This was achieved through approx-
imations and hardware optimizations, as well as using only one
set of network parameters compared to conventional BNNs.
We leave further hardware-level dissemination, upscaling, hy-
perparameter optimization, and tuning to future works.
REFERENCES
[1] M. Courbariaux and Y. Bengio, “BinaryNet: Training Deep Neural
Networks with Weights and Activations Constrained to +1 or
-1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http:
//arxiv.org/abs/1602.02830
[2] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh,
and D. Marr, “Accelerating Binarized Neural Networks: Comparison of
FPGA, CPU, GPU, and ASIC,” in 2016 International Conference on
Field-Programmable Technology (FPT), Dec 2016, pp. 77–84.
[3] C. Lammie, A. Olsen, T. Carrick, and M. Rahimi Azghadi, “Low-Power
and High-Speed Deep FPGA Inference Engines for Weed Classification
at the Edge,” IEEE Access, vol. 7, pp. 51 171–51 184, 2019.
[4] L. Yang, Z. He, and D. Fan, “A Fully Onchip Binarized
Convolutional Neural Network FPGA Impelmentation with Accurate
Inference,” in Proceedings of the International Symposium on
Low Power Electronics and Design, ser. ISLPED ’18. New
York, NY, USA: ACM, 2018, pp. 50:1–50:6. [Online]. Available:
http://doi.acm.org/10.1145/3218603.3218615
[5] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “FP-BNN: Binarized
neural network on FPGA,” Neurocomputing, vol. 275, pp. 1072 – 1086,
2018. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0925231217315655
[6] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating Deterministic
and Stochastic Binarized Neural Networks on FPGAs Using OpenCL,”
in 2019 IEEE 62nd International Midwest Symposium on Circuits and
Systems (MWSCAS), Aug 2019, pp. 626–629.
[7] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “BNN+:
Improved Binary Network Training,” CoRR, vol. abs/1812.11800, 2018.
[Online]. Available: http://arxiv.org/abs/1812.11800
[8] W. Tang, G. Hua, and L. Wang, “How to Train a Compact Binary
Neural Network with High Accuracy?” 2017. [Online]. Available:
https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14619
[9] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary Convolutional
Neural Network,” in Advances in Neural Information Processing
Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates,
Inc., 2017, pp. 345–353. [Online]. Available: http://papers.nips.cc/
paper/6638-towards-accurate-binary-convolutional-neural-network.pdf
[10] Z. Cao, M. Long, J. Wang, and P. S. Yu, “HashNet: Deep Learning
to Hash by Continuation,” CoRR, vol. abs/1702.00758, 2017. [Online].
Available: http://arxiv.org/abs/1702.00758
[11] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. Shanbhag, “True
Gradient-Based Training of Deep Binary Activated Neural Networks
Via Continuous Binarization,” in 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp.
2346–2350.
[12] F. Lahoud, R. Achanta, P. Márquez-Neila, and S. Süsstrunk, “Self-
Binarizing Networks,” CoRR, vol. abs/1902.00730, 2019. [Online].
Available: http://arxiv.org/abs/1902.00730
[13] Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu,
“Towards Binary-Valued Gates for Robust LSTM Training,” CoRR, vol.
abs/1806.02988, 2018. [Online]. Available: http://arxiv.org/abs/1806.
02988
[14] A. Krizhevsky et al., “Learning Multiple Layers of Features from Tiny
Images,” Citeseer, Tech. Rep., 2009.
[15] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
CoRR, vol. abs/1412.6980, 2014.
[16] Z. Zhang and M. R. Sabuncu, “Generalized Cross Entropy Loss
for Training Deep Neural Networks with Noisy Labels,” CoRR, vol.
abs/1805.07836, 2018. [Online]. Available: http://arxiv.org/abs/1805.
07836
[17] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-
Supervised Nets,” in Proceedings of the Eighteenth International
Conference on Artificial Intelligence and Statistics, ser. Proceedings of
Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan,
Eds., vol. 38. San Diego, California, USA: PMLR, 09–12 May
2015, pp. 562–570. [Online]. Available: http://proceedings.mlr.press/
v38/lee15a.html
[18] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks
for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
[19] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural
Networks,” in Proceedings of the Fourteenth International Conference
on Artificial Intelligence and Statistics, ser. Proceedings of Machine
Learning Research, G. Gordon, D. Dunson, and M. Dudik, Eds., vol. 15.
Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 315–323.
[Online]. Available: http://proceedings.mlr.press/v15/glorot11a.html
[20] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming
Standard for Heterogeneous Computing Systems,” Computing in Science
Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
[21] T. Sorensen and A. F. Donaldson, “The Hitchhiker’s Guide to
Cross-Platform OpenCL Application Development,” in Proceedings
of the 4th International Workshop on OpenCL, ser. IWOCL ’16.
New York, NY, USA: ACM, 2016, pp. 2:1–2:12. [Online]. Available:
http://doi.acm.org/10.1145/2909437.2909440
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification
with Deep Convolutional Neural Networks,” Commun. ACM, vol. 60,
no. 6, pp. 84–90, May 2017. [Online]. Available: http://doi.acm.org/10.
1145/3065386
