Convolutional neural networks have recently achieved significant breakthroughs in various image classification tasks. However, they are computationally expensive, which can make their feasible implementation on embedded and low-power devices difficult. In this paper convolutional neural network binarization is implemented on GPU-based platforms for real-time inference on resource constrained devices. In binarized networks, all weights and intermediate computations between layers are quantized to +1 and -1, allowing multiplications and additions to be replaced with bit-wise operations between 32-bit words. This representation completely eliminates the need for floating point multiplications and additions and decreases both the computational load and the memory footprint compared to a full-precision network implemented in floating point, making it well-suited for resourceconstrained environments. We compare the performance of our implementation with an equivalent floating point implementation on one desktop and two embedded GPU platforms. Our implementation achieves a maximum speed up of 7.4× with only 4.4% loss in accuracy compared to a reference implementation.
Introduction
In the recent years, convolutional neural networks (CNNs) have presented impressive performance in image classification [16] [4] , face recognition [17] [19] , audio classification [14] , and speech recognition [7] ."
Large neural network models can be computationally expensive, making them unsuitable for deployment to small resource-constrained mobile devices. To this extent, contemporary CNN-based solutions often acquire the input data on a mobile device, but transmit the data to a remote server for CNN-based processing. However, performing the CNN-based processing on the mobile device (a.k.a. edge computing) would reduce the overall system complexity and enable real-time applications.
The emerging CNN subfield of model compression aims to retain the accuracy of the neural network while minimizing redundant network parameters and reducing computational load. Many such techniques have already been proposed.
One technique [9] is based on pruning of parameters, where majority of the parameters of the network are removed without significantly impacting accuracy. Reduction of parameters initially leads to a significant drop in accuracy; however, retraining (fine-tuning) of the parameters restores most of the network's accuracy. The authors report 13× reduction of memory requirements with no loss in accuracy [9] .
Another approach, low-rank approximation of convolutional kernels [13] , approximates 2D convolutions with convolutions by vectors. The separable kernels can be obtained either by training the network with separable filters [1] or by posing it as an optimization problem to minimize the reconstruction error of the feature maps. Depending on the approach [13] [1] , speedups between 2× to 4× have been reported on CPU implementations.
Binarized neural networks (BNN) have been first introduced in [11] , where their performance was demonstrated on the CIFAR-10 dataset. The weights and activations for intermediate computations are binarized to +1 and −1. The authors present a speed up of 7× on a network for the MNIST dataset. In a further work [21] the approach was refined for CPU implementation and evaluated on the ImageNet dataset.
In this paper, an approach for the implementation of binarized convolutional neural networks (BCNNs) on GPU platforms is presented. A recent work (to appear in ICLR 2018) [20] presents similar BCNN optimizations on GPUs. Our work was developed simultaneously with [20] but differs in several respects. We provide a more detailed (layerwise) performance comparison with state of the art full precision GPU implementations such as cuDNN, cuBLAS, and the ARM Compute libraries. Also, instead of concentrating only on the desktop GPU, we test and evaluate the performance of our GPU-optimized BCNNs (written in both CUDA and OpenCL) on several different platforms, including embedded ones. Additionally, we present our implementation with an application use case of vehicle type classification [12] . Finally, we employ alternative input binarization strategies and study the impact of each approach on the classification accuracy. Our results show significant speedups in real-time inference compared to a floating point version of an equivalent neural network.
As a summary, the contributions of this work are as follows:
• Detailed presentation of efficiently implementing CNN binarization, including the convolutional layers, on GPU-based platforms.
• Study and comparison of different approaches for binarizing input data, and how each approach impacts the classification accuracy.
• Performance (execution time) measurements and comparisons on several platforms against state of the art full-precision GPU implementations.
The source code for our CUDA implementation is publically available 1 .
Experimental Setup

Binarizing the network
Our binarized network architecture is based on the original vehicle classifier network presented in [12] . We implement a binarized version of the same architecture in several steps. We do not use any ReLU [6] activations in the binarized version. In the original binarization work [5] , the authors suggest two approaches for binarization: stochastic and deterministic. For binarizing the weights and intermediate computations, we use the deterministic sign function, which is defined as
For training the BNN, following [10] , we explicitly define the gradient of the sign function to be the identity function in the backward pass, such that ∂sign(x) ∂x = x. The non-binarized network is trained with the RMSprop optimizer [23] , while the binarized version is trained with the ADAM [15] optimizer. After training, only the binarized weights are used for inference for the binarized network.
The network is trained with a dataset set consisting of 6555 images of vehicles that have been captured by a camera and manually categorized into four different classes: bus, normal, truck, and van. Each image has size 96 × 96 and are in full color. The data has been split into a training set (90%) and a test set (10%). We augment the training set using flipping and filtering with a 2D Gaussian filter with σ = 0.5, resulting in a total training set size of 14,108 images, 20% of which are used for validation. Throughout this text, our accuracy reports are for the performance of the network on the test set that corresponds to the best validation set accuracy.
Testing pipeline
For obtaining runtime results, we use the built-in GPU timers to measure the runtime of the kernels for our CUDA 1. github.com/Valentin4869/BinCNN and OpenCL programs. Our kernel execution time measurements do not include memory transfer times to/from the GPU, as they can be affected by various factors, some of which are hardware-dependent, for example, on the NVidia Jetson host and device memory are shared. The correctness and accuracy of the profiling results generated have been verified by the Nvidia Visual Profiler for the same CUDA programs.
For each test run, 1000 images are randomly generated and fed to the network one at a time. The timer begins after the memory is copied, and the timer ends after the last kernel's computation is completed. Our final result is the total accumulated time per sample averaged over all 1000 samples.
Input binarization
In this section we describe our methods for binarizing the inputs to the first layer of our BNN. We pre-process the data set using these techniques and evaluate the accuracy of the BNN on the pre-processed data set.
Thresholding A constant threshold T can be subtracted from the input X before binarizing it. We simply substitute the input X to the first layer with sign(X + T ), for X ∈ R H×W ×C , and for T ∈ R 1×1×C . The motivation is to shift the range of values taken by X such that binarization with the sign function produces meaningful results, as opposed to all zeros for standard pixel-value ranges do not include negative numbers. The network is trained as before but in two stages: first, the network is trained for 50 epochs and the loss is minimized with respect to all network parameters except for T . Then a second stage of tuning is entered where we minimize the loss with respect to the parameter T and the validation set. We repeat this process for several thousand training epochs until the performance on the validation set no longer improves. Local Binary Patterns (LBP) A well-known technique called local binary patterns for extracting multi-resolution and scale-invariant features from images has been introduced in [18] . We use a similar approach in our application for image binarization, but with a slight modification: we operate on the grayscale image and process each pixel by examining its neighborhood at a radius of 1 pixel, generate 3 artificial color channels and select 3 pixels at a clockwise stride of 3 in the neighbourhood to distribute to these channels. Then the value of these pixels are set to 1 if they exceed the value of the center pixel and 0 otherwise. An example of this transformation on an image from the dataset is demonstrated in the second row of Figure 2 .3.
Packing binary-valued vectors
To avoid confusion with terminology, we denote by packing the encapsulation/conversion of an array of 1-bit values into an individual 32-bit unsigned integer. Formally, for a binary-valued vector x ∈ {−1, +1} D , assuming D is divisible by B, then the packed representation of x, x p ∈ 0, 1, 2, 3, ..., 2 B−1 D/B for a packing bitwidth B ≤ 32 (assuming 32-bit word) and positive D, is given by
Implementation
In this section, we present the details of our CUDA implementation of the binarized neural network architecture described in Section 2. We use CUDA terminology throughout this section.
Convolutional layers
The convolutional layer in a neural network can significantly improve image classification accuracy compared to standard multi-layer perceptrons. Given a kernel H ∈ R K×K×C and an image X ∈ R H×W ×C , an output feature map F ∈ R H×W is given by the expression
for odd K, and the kernel radius R = K−1 2 . It should be noted that equation (3) in fact computes cross-correlation (not convolution), which is the convention in deep learning. A common approach for computing convolutions efficiently is through matrix multiplication [2] , where the weights and image tensors are reshaped into 2-dimensional matrices, which will then allow us to compute the convolution through a single matrix multiplication. The reshaping for the weights is trivial, and this step can often be skipped if the weights are already stored in this layout; however, the process of arranging the input image into the matrix of columns used for computing the convolution can be difficult to optimize. This is due to inefficient access patterns, complicated index calculations that involves many division and modulo operations, and the overhead of storing the large output matrix to global memory.
A straightforward approach for avoiding inefficient access patterns is to load regions from the image into shared memory (on-chip memory) and then extract the patches from shared memory [3] . For an image with dimensions H × W × C corresponding to height, width, and channels respectively, and a K × K × C kernel with a radius of R = K−1 2 , we use threadblock dimensions of S ×W (S = 2 in our case), which covers the entire width of the image, eliminating the need to redundantly load the horizontal non-zero halo regions which are difficult to load with an efficient access pattern. Then each thread-block loads an image region of dimensions (S + 2R) × W into a region in shared memory in three steps, starting by loading the top vertical halo region, the middle part, then the bottom vertical halo region (except when loading from the bottom of the image). The shared memory buffer is zero-initialized in order to implicitly handle horizontal zero-padding. Loading vertical halo regions can be done very efficiently since all threads in the threadblock load from contiguous regions in the image array.
In the second stage, the patches of size K × K × C are extracted from shared memory. We avoid division and modulo operations in the patch-extraction stage by using an integer counter register. This results in a 2× performance boost in our case. Since the network is binarized, the packing and patch-extraction step can be fused into one step to avoid redundant accesses to global memory, reducing global memory stores by K × K. The algorithm for the combined step of extracting the patches and packing them is shown in Algorithm 1. 
return v In Algorithm 1, sh_block is the region of the image loaded into shared memory using the previously described steps, including the halo regions. _t x and _t y are the thread indices for the x and y dimensions of the thread block corresponding to the CUDA threadIdx.x and threadIdx.y variables. B is the packing bitwidth, chosen to be 25 in our case, << is the left bit-shift operator, and v is the packed extracted patch.
For computing the convolution we implement a standard matrix multiplication subroutine in a manner similar to [22] , where tiles from each matrix are loaded successively into shared memory and used to compute a submatrix of the output, such that each thread computes a single element in the output matrix, but instead of computing multiplications, we compute xnors and bit-counts following an approach similar to what was suggested in [11] as a·b = W − 2 × popcount (xor(A, B) ), (4) where A and B are both 32-bit unsigned integer registers containing the packed representations of vectors a, b ∈ {−1, +1} W respectively. We denote by · the real-valued dot product. The operation xor is the bit-wise xor operation, and popcount is a function for computing the number of bits set to 1. The packing bitwidth W is the number of elements that are packed together in a single unsigned integer register.
Fully connected layer
For the fully-connected layer, we follow a slightly different approach from standard matrix multiplication. For a packed weights matrix W ∈ R L×D , and a packed vector x ∈ R D×1 , we divide the process of computing the dot product of each weight vector and x into 64 segments, such that each of 64 threads handling a weight vector compute the partial sum of the dot product between a weight vector and x through xnor operations, and stores the results in shared memory. The partial sums are then combined in a parallel reduction sum that does not require synchronization (for a warp size of 32 on the target platform).
Results
In this section we present our results for the impact of input binarization on classification accuracy and the performance improvement achieved.
Input binarization in Table 3 we report the classification accuracy results we obtained using each different input binarization scheme for our binarized version of the vehicle classifier [12] . We can observe that accuracy is best retained when the first layer is not binarized; however, only a moderate loss in accuracy occurs when using LBP and RGB Thresholding. Considering that RGB Thresholding is much simpler to implement and results in almost no additional computational overhead, we choose this approach for our final binarized architecture, for which we report the speed up results in the following section.
Performance Boost We time our binarized implementation on 3 different hardware platforms: Nvidia GTX 1080, Nvidia Jetson (Tegra X2), and the Mali-T860. We derive an OpenCL version of our implementation for testing on the Mali-T860, which is a straightforward process. We compare the performance of our implementation against an equivalent full precision version of the same network implemented with highly optimized libraries on each target platform, in our case these are cuDNN on Nvidia platforms, and the ARM Compute Library on the Mali-T860. We list in Table  1 the average execution times of the full network on each platform. We can see that our binarized implementation can achieve up to 7.5× speed up on the GTX1080 and about 5.5× on the Tegra X2. We also notice that the relative performance improvement on Mali GPU is much smaller at about 1.7× for the fully binarized version. In our optimizations, we heavily take advantage of using local memory (in OpenCL terms) which resides on-chip in most workstation GPUs and the Nvidia Tegra X2, but this does not offer any performance benefits on Mali GPUs since local memory is allocated in global memory. It should be noted that cuDNN is optimized for batch processing and that our results are for one sample at a time which means these results may not necessarily be reflective of the full potential of cuDNN; however, batch processing is not a suitable option for real-time applications where a single input is processed at a time. Additionally, we note that for our cuDNN implementations, we use the explicit GEMM convolution algorithm, which can be slightly slower than the implicit GEMM algorithm. For example, cuDNN with implicit GEMM can run at 316µs for the first convolutional layer in our network on the GTX1080. For a more detailed comparison, we present the execution times for each individual layer in Table 2 . Each layer's name is followed with the dimensions of the input, except for the convolution layers where the dimensions are for the kernels, and the input dimensions can be inferred from the previous layer. This table compares the execution time of our binarized implementations with the full-precision versions of the same layer in cuDNN on the GTX1080. We omit from the table the computation times for ReLU activations, which are present in the full-precision version of the network, but are absent from the binarized version. We also omit the last 2 fully-connected layers since they are too small and in most practical applications it would be more efficient to implement them on the CPU. We include the computation time for packing the outputs of the previous layer in the binarized version of the fully-connected layer for a fair comparison. The results in Table 2 have been obtained directly from the Nvidia Visual Profiler.
It should be noted that the runtime for the fullyconnected layer for full-precision cuDNN in Table 2 includes a matrix transposition. The run time excluding matrix transposition is about 100µs; however, it is a necessary step for evaluating this layer. Our full-precision matrix multiplication kernel is in fact 2× slower than cuBLAS (as measured in this network), yet a significant speed-up is still achievable through binarization.
Conclusion and Future Work
We presented an efficient implementation of a binarized convolutional neural network on GPUs that can achieve a significant decrease in runtime while reasonably preserving classification accuracy. In the future we wish to restructure our algorithms to achieve a similar performance improvement on other embedded platforms. We are also planning to extend this work to alternative convolution algorithms such as implicit GEMM, which can be faster than explicit GEMM. Finally, we plan to extend our study of how input binarization impacts classification accuracy on larger datasets with more difficult classification tasks.
